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Session  1: 

Sensors  and  Sensor  Fusion 


Physiological  Sensors  for  Speech  Recognition 

Mike  Scanlon,  Francis  Fisher,  Steve  Chen 


Abstract.  Systems  designers  are  expressing  greater 
interest  in  speech-based  user  interfaces  for  a  variety  of 
civilian  and  military  applications.  Such  interfaces  provide 
hands-free  operation  and  a  more  natural  way  for  humans 
to  interact  with  systems.  One  difficulty  with  speech-based 
user  interfaces  is  poor  operation  in  noisy  environments 
such  as  military  operations.  The  Physiological  Sensor, 
developed  at  ARL,  is  an  example  of  an  alternative  sensor 
for  automatic  speech  recognition.  This  sensor  detects 
speech  by  measuring  acoustic  signals  through  the 
speaker’s  skin.  While  the  signal  produced  is  not  typical  of 
that  from  an  airborne  acoustic  microphone,  the  possibility 
exists  for  using  this  sensor  as  a  microphone.  We 
investigate  several  possible  methods  for  using  the 
Physiological  Sensor  as  a  microphone  for  automatic 
speech  recognition. 

1.  Introduction 

With  recent  advances  in  automatic  speech  recognition 
(ASR)  technology  has  come  an  increased  interest  in 
applying  this  technology  to  the  design  of  user  interfaces. 
For  a  system  being  operated  in  a  benign  environment  such 
an  interface  can  be  based  on  commercial  or  custom 
software  and  an  airborne  acoustic  microphone.  However, 
most  systems  of  this  type  are  difficult  or  impossible  to  use 
in  noisy  environments  such  as  those  presented  in  military 
or  industrial  scenarios.  In  such  cases  we  must  find 
alternative  ASR  software  or  speech  sensors  in  order  to 
enhance  operation  in  these  environments.  Efforts  to 
improve  operation  in  noisy  environments  by  removing  the 
noise  from  the  microphone  output  have  proven  difficult 
without  knowledge  of  the  external  noise  source. 

The  Physiological  Sensor,  a  medical  sensor  developed  at 
Army  Research  Laboratory,  is  a  device  that  physically 
couples  to  a  patient  to  record  medical  information  such  as 
respiration  and  heartbeat.  With  some  slight  modifications 
to  the  electronics,  ARL  has  converted  this  sensor  to  a 
microphone  to  be  worn  around  the  throat. 

2.  Physiological  Sensor  -  Background 

ARL  has  developed  a  new  method  to  measure  human 
physiology  and  monitor  health  and  performance 
parameters.  This  consists  of  an  acoustic  sensor  positioned 
inside  a  fluid-filled  bladder  in  contact  with  the  human 
body.  Packaging  the  sensor  in  this  manner  minimizes 


outside  environmental  interferences,  and  signals  within 
the  body  are  transmitted  to  the  sensor  bladder  with 
minimal  losses.  This  fluid-coupling  technology 
comfortably  conforms  to  the  human  body,  and  enhances 
the  signal-to-noise-ratio  (SNR)  of  human  physiology  to 
that  of  ambient  noise.  An  acoustic  sensor  system  can 
detect  changes  in  a  person’s  physiological  status  resulting 
from  exertion  or  injuries  such  as  trauma,  penetrating 
wound,  hypothermia,  dehydration,  heat  stress,  and  many 
other  conditions  (or  illnesses).  Furthermore,  a  sensor 
contacting  the  torso,  head,  or  throat  region  picks  up  the 
wearer's  voice  very  well  through  the  flesh,  with  fidelity 
sufficient  to  be  used  as  an  auxiliary  microphone  for 
communications  or  hands-free  voice  activation 
mechanism.  Automatic  speech  recognition  software,  in 
conjunction  with  this  enhanced  body-coupling  sensor, 
could  improve  mission  performance  by  reducing  false 
voice  commands  through  improved  SNR  in  noisy 
environments.  Civilian  technology  transfer  applications 
include  clinical  surveillance,  medical  transport,  hospitals, 
and  telemedicine  applications.  Fire,  rescue,  and  police 
personnel  may  benefit  from  hands  free  voice 
communications  with  embedded  health  and  performance 
monitoring  [Scanlon,  patents]. 

2.1  Sensor  Description 

The  neck-band  sensors  shown  in  figures  1  and  2  consists 
of  a  housing,  gel-coupling  sack  with  sensor  embedded 
within,  neck  strap,  preamplifier,  and  battery  pack  with 
hardwired  signal  egress  and  push  to  talk  button..  The 
headband  sensor  in  figure  3  does  not  use  a  liquid 
coupling,  but  rather  an  acoustically  conductive  silicone 
rubber. 

Data  were  collected  at  the  side  of  the  neck  using  a  sensor 
of  similar  geometry  to  the  sensor  in  figure  1  [Scanlon, 
1998].  The  test  included  a  spoken  word  count  from  1  to 
1 0,  then  mouth  breathing  for  the  remainder  of  the  data  set. 
Naturally,  the  heartbeat  is  always  present.  The  time  and 
frequency  representations  are  shown  in  figure  4.  Figure  5 
compares  data  from  a  B&K  microphone  in  front  of  the 
speaker’s  mouth  to  that  of  a  fluid-coupled  physiological 
sensor  held  in  contact  with  the  neck  by  a  strap.  Data  from 
both  locations  were  taken  simultaneously  in  a  typical 
office  environment.  Comparing  the  amplitudes  of  the 
voice  to  the  non-vocal  ambient  noise  surrounding  the 
voice  gives  approximately  40  dB  SNR  for  the  B&K 
airborne  microphone,  and  approximately  75  dB  SNR  for 
the  fluid-coupled  sensor.  The  fluid  coupling  represents  an 
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improvement  of  better  than  30  dB  in  speech  SNR  with 
minimal  waveform  degradation,  as  observed  by  the 
similarity  of  spectrograms  and  by  listening  to  the  data 
through  headphones. 

Time  (s) 

Time  (5) 

The  ability  of  body-coupled  sensors  to  detect 
physiology  and  reduce  background  noise  was 
investigated.  A  physiological  sensor  was  attached  to 
one  side  of  a  speaker’s  neck,  and  an  omnidirectional 
electret  microphone  was  place  in  front  of  the  mouth. 
Figures  6  and  7  show  simultaneously  collected  breath 
and  voice  data  before,  during,  and  after  a  speaking 
subject  is  immersed  in  a  C-weighted  noise  field  of 
105  dB  (referenced  to  20  micropascals)  noise  field. 

The  person  wearing  the  sensors  repeatedly  vocalized 
a  1  to  10  count  between  the  times  of  14-  and  19-s,  25- 
to  33-s,  65-  to  71-s,  and  71-  to  77-s,  and  vocalized 
“105  dB”  between  47-  and  50-s. 

The  boom  microphone  in  figure  6  does  not  detect  any 
voice  during  the  high  amplitude  noise  between  20- 
and  71-s.  However,  in  figure  7,  the  counting  is 
clearly  visible  throughout  the  loud  noise  with 
the  body-coupled  gel 

sensor.  Playing  the  data  collected  through  headsets, 
the  listener  could  clearly  hear  and  understand  the 
spoken  words  from  the  gel  sensor  in  105  dB  noise, 
but  could  not  discern  the  presence  of  any  speech  in 
the  boom  microphone  data. 

3.  Automatic  Speech  Recognition  Using  the 
Physiological  Sensor 

Army  Research  Laboratory  (ARL)  and  Rockwell  Sciences 
Center  (RSC)  have  developed  several  experimental 
systems  that  use  the  Physiological  Sensor  as  input  to 
automatic  speech  recognition  (ASR)  systems.  These 
efforts  are  discussed  below. 

3.1  RSC  Integration  &  application  of  the 
Physiological  Sensor 

3.1.1  General  Signal  Characteristics  of  the 
Physiological  Sensor 

By  coupling  directly  to  the  user’s  neck,  the  physiological 
sensor  was  able  to  achieve  extraordinary  signal  to  noise 
performance  as  compared  to  airborne  acoustic 
microphone  technologies.  While  providing  significant 
rejection  of  ambient  noise,  the  sensor  was  not  entirely 
immune  to  ambient  sound.  For  instance,  it  was  quite 
possible  to  detect  other  persons  speaking  to  the  wearer  of 


the  physiological  sensor,  though  at  greatly  attenuated 
levels.  Due  largely  to  the  method  of  transduction,  the 
output  signal  of  the  ARL  physiological  sensor  was 
significantly  different  from  typical  acoustic  microphone 
signals.  Specifically,  higher  frequencies  tended  to  be 
significantly  attenuated.  Human  listeners  listening  to  the 
output  signal  of  the  physiological  sensor  indicated  that  the 
distortion  was  analogous  to  listening  to  a  person  in 
another  room  through  a  wall. 

3.1.2  Physiological  Sensor  and  Speech 
Recognizers 

Because  of  the  inherent  distortions  of  speech  associated 
with  the  ARL  physiological  sensor,  many  commercial, 
off-the-shelf  ASR  technologies,  like  IBM’s  ViaVoice, 
were  unable  to  successfully  recognize  speech  using  the 
physiological  sensor  signals.  Such  recognizers  often  rely 
on  Hidden-Markov  Models  of  speech,  where  the  models 
are  pre-estimated  using  statistical  methods  and  large 
databases  of  human  speech.  Such  databases  would  have 
been  collected  with  conventional  airborne  acoustic 
microphones,  so  any  speaker-independent  speech 
recognizer  would  have  an  inherent  expectation  about  the 
signal  characteristics  of  speech  as  normally  acquired 
through  airborne  acoustic  microphones.  Hence,  in 
performing  speech  recognition  with  the  physiological 
sensor,  speaker-dependent  recognizers  tended  to  work 
more  reliably.  As  recommended  by  ARL,  the  initial 
speech  recognition  engine  utilized  was  the  Clamor  engine, 
a  dynamic-time-warping  speech  recognizer  developed  by 
the  Lexicus  business  unit  of  Motorola.  Clamor  recorded 
templates  of  each  word  or  phrase  (“token”)  to  be 
recognized  as  provided  by  the  user  (2  instances  of  each 
token  were  kept  as  matching  templates).  Performance 
with  the  Clamor  recognition  engine  was  adequate  for 
discrete,  speaker  dependent  recognition  of  up  to  several 
distinct  tokens. 

Later,  Rockwell  Science  Center  developed  a  speaker- 
dependent,  Hidden-Markov  Model  based  discrete  speech 
recognizer  for  use  with  the  physiological  sensor.  The 
HMM-based  recognizer  was  designed  using  HTK,  a 
product  of  the  former  Entropic  Research  Laboratories. 
Like  the  DTW-based  Clamor  recognizer,  RSC’s  HMM- 
based  recognizer  provided  discrete  recognition  for  up  to 
several  distinct  tokens.  The  key  difference  was  that  with 
an  HMM-based  recognizer,  additional  training  samples 
could  be  used  to  re-estimate  the  speech  models,  and 
presumably  build  a  more  robust,  statistically  accurate 
model  of  each  token  as  more  and  more  training  utterances 
were  collected  from  the  user.  The  refined  HMM  models 
should  perform  better,  while  still  maintaining  the  same 
level  of  computational  complexity.  With  the  DTW 
approach,  the  use  of  additional  user  utterances  for 
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recognizer  training  would  necessarily  increase  the 
computational  burden  of  speech  recognition  at  runtime  - 
the  more  templates  that  were  collected,  the  longer  each 
match  would  take. 

In  order  to  support  rapid  integration  and  testing  of  user 
interfaces  involving  the  physiological  sensor,  it  was 
integrated  with  Rockwell’s  Automatic  Speech 
Recognition  (ASR)  Server  technology.  The  ASR  Server 
provided  abstraction  of  an  encapsulated  speech 
recognition  engine  (Clamor  was  used  for  the 
physiological  sensor)  through  a  platform-neutral  TCP/IP 
socket  interface.  Applications  could  be  quickly  designed 
to  exploit  speech  recognition  services  of  the  ASR  Server 
through  a  simplified  protocol.  The  ASR  Server  could,  in 
turn,  provide  speech  recognition  through  either  the 
physiological  sensor,  or  a  conventional  acoustic 
microphone.  The  physiological  sensor  was  demonstrated 
in  conjunction  with  Rockwell’s  Multimodal  Integrated 
Displays  Testbed  in  early  1999. 

In  early  2000,  RSC’s  HMM-based  recognizer  for  the 
physiological  sensor  was  integrated  with  RSC’s  Bimodal 
ASR  Server.  The  Bimodal  ASR  Server  employed  a 
subset  of  the  same  client/server  interface  protocol  used  by 
the  ASR  Server;  whereas  the  ASR  Server  encapsulated 
COTS  acoustic  speech  recognizers,  the  Bimodal  ASR 
Server  encapsulated  more  experimental  recognition 
technologies,  including  both  the  HMM-based  recognizer 
for  the  physiological  sensor,  as  well  as  the  visual  lip¬ 
tracking  based  speech  recognizer  described  in  elsewhere 
in  this  text.  The  physiological  sensor  and  Bimodal  ASR 
Server  were  demonstrated  as  components  of  Rockwell’s 
Integrated  Displays  Testbed  v2  in  early  2000  [Vassiliou, 
00].  As  part  of  the  demonstration,  a  user  could 
dynamically  switch  between  speech  recognition  using 
either  the  lip  tracker  or  the  physiological  sensor. 

The  natural  extension  of  this  work  would  be  development 
of  a  hybrid  speech  recognition  technology  that 
concurrently  uses  both  the  physiological  sensor  and  the 
visual  speech  recognizer.  The  two  technologies  are 
uniquely  complementary  because  while  the  visual  speech 
recognizer  leverages  key  visible  features  of  speech 
articulator  motion  (vital  for  recognition  of  consonant 
sounds),  it  is  unable  to  distinguished  voiced  from 
unvoiced  speech,  and  indeed  is  fairly  unsuitable  for 
discrimination  of  vowel  sounds  from  one  another.  On  the 
other  hand,  because  of  its  nearly  direct  coupling  to  the 
vocal  tract,  the  physiological  sensor  is  advantageously 
placed  for  detecting  voicing  and  discriminating  vowel 
sounds,  while  its  ability  to  capture  subtle  acoustic 
transients  of  consonant  production  may  be  compromised 
by  its  body-coupled  nature.  The  visual  speech  recognizer 
is  already  HMM  based,  so  significant  research 
opportunities  exist  for  the  development  of  appropriate 


feature  vectors  and  HMM  topologies  to  integrate  the  two 
distinct  signal  streams  (visual  &  acoustic). 

3.1.3  Ergonomics 

The  physiological  sensor  was  found  to  be  generally 
comfortable  to  wear,  though  there  were  some  issues  with 
the  design.  One  obvious  problem  was  that  users  wearing 
a  collared  dress  shirt  could  have  problems  fitting  the 
physiological  sensor  band  either  above  or  under  the 
collar.  Generally,  with  a  shirt  collar  closed,  fitting  the 
physiological  sensor  inside  the  collar  band  was  not 
practical.  Wearing  the  physiological  sensor  higher  on  the 
neck  than  a  closed  shirt  collar  tended  to  limit  head 
movement.  Possibly,  a  narrower  band  and  smaller  sensor 
capsule  could  help  with  these  issues. 

The  neck  band  itself  was  fairly  easy  to  secure  due  to  the 
use  of  Velcro  surfaces.  The  fabric  of  the  neck  band  was 
of  a  dense  weave,  which  could  lead  to  the  accumulation 
of  perspiration  under  the  neck  band  under  some 
conditions.  A  thinner,  more  loosely  woven  fabric, 
perhaps  an  elastic  one,  might  be  helpful. 

The  physiological  sensor  was  also  compared  to  a  similar 
COTS  throat  worn  microphone  product,  the  LASH  11 
microphone  distributed  by  Television  Equipment 
Associates.  While  the  LASH  II  did  use  a  thinner, 
narrower,  elastic  collar  band,  the  plastic  hook  assembly 
for  closing  and  securing  the  LASH  II  was  not  as  easy  to 
use  as  the  Velcro  design  of  the  ARL  physiological  sensor. 
Further,  the  LASH  II  design  caused  two  rigid  plastic 
nodes  to  be  pressed  against  the  user’s  throat,  which  could 
cause  significant  discomfort  when  worn  over  extended 
periods.  In  contrast,  wearers  generally  did  not  find  the 
ARL  physiological  sensor  to  increase  in  discomfort  over 
time. 

Some  hesitance  and  psychological  resistance  to  wearing 
the  physiological  sensor  was  also  reported  of  prospective 
users.  An  obvious  safety  concern  for  any  neck  worn 
apparatus  is  the  possibility  of  choking,  either  by  accident 
or  by  assailants.  Also,  while  head  worn  microphones  of 
some  styles  have  come  to  be  socially  acceptable  to 
wearers  and  even  fashionable  or  “cool”  in  certain 
contexts,  the  visual  appearance  of  the  neck  worn 
physiological  sensor  was  less  acceptable  to  some  users. 

3.1.4  Physiological  Sensor  Integration  Issues 

In  early  1999,  Rockwell  received  first  samples  of  the 
ARL  Physiological  Sensor  technology.  Early  samples 
used  a  fairly  large  (~5”x3”x2”)  preamplification  module, 
which  was  rather  bulky  and  not  well  suited  to  bodyworn 
applications.  Despite  having  a  full  metal  casing,  the 
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combination  of  physiological  sensor  and  preamplification 
module  was  also  susceptible  to  grounding  problems, 
which  would  cause  a  strong  60Hz  hum  to  be  present  in 
the  output  signal.  The  grounding  problems  were 
corrected  in  the  next  received  prototype  early  in  1999  and 
the  physiological  sensor  was  successfully  mated  to  a  PC- 
based  sound  card  using  the  line  level  input.  Some  speech 
recognizers  are  designed  with  the  assumption  that  the 
microphone  input  of  a  sound  card  will  be  used  for  speech 
acquisition,  so  the  user  of  line  level  input  could  have  been 
an  integration  issue  for  some  speech  recognition 
technologies. 

Newer  versions  of  the  physiological  sensor  supplied  by 
ARL  in  late  1999  and  early  2000  used  a  much  smaller  and 
lighter  preamplification  module  (~2”xl”x.5”)  in  a  plastic 
rather  than  a  metal  housing.  The  new  preamplification 
module  was  light  enough  to  be  carried  with  the  user,  and 
the  signal  level  was  suitable  for  use  with  the  microphone 
inputs  of  typical  PC  sound  cards.  It  also  included  a 
momentary  push-to-talk  switch.  Conceptually,  a  push-to- 
talk  switch  is  helpful  in  speech  recognition  applications 
because  if  the  press  and  release  events  for  the  switch  can 
be  detected  by  the  speech  recognizer,  then  delimitation  of 
user  utterances  becomes  fairly  easy.  Also,  the  use  of  a 
push-to-talk  switch  helps  to  prevent  false  recognition 
(insertion)  errors  where  extraneous  noises  or  speech  not 
intended  for  the  recognizer  are  acquired  by  the  transducer. 
In  the  case  of  the  current  versions  of  the  physiological 
sensor  though,  the  implementation  of  the  push-to-talk 
switch  is  suboptimal  for  speech  recognition.  First,  the 
switch  is  electromechanical  and  entirely  embedded  in  the 
preamp  module  of  the  physiological  sensor,  so  there  is  not 
a  deterministic  way  (e.g.  additional  connector  pin)  for  an 
attached  device  or  computer  to  ascertain  when  the  switch 
is  pressed  and  released.  The  switch  also  induces 
significant  transients  in  the  sensor’s  output  signal  when  it 
is  pressed  and  released.  Such  transients  in  the  speech 
signal  are  apt  to  confuse  most  existing  speech  recognition 
technologies.  The  workaround  solution  employed  to 
address  these  issues  was  to  keep  the  push-to-talk  switch 
depressed  at  all  times  while  using  a  speech  recognition 
system,  and  to  rely  on  other,  external  push-to-talk  switch 
mechanisms  that  were  more  readily  tracked  by  the 
Rockwell  ASR  Server.  Additionally,  because  the  push- 
to-talk  switch  was  of  a  momentary-on  design,  additional 
external  fixtures  were  required  to  keep  the  switch 
depressed. 

For  some  applications,  it  was  desirable  for  the  user  of  the 
physiological  sensor  to  be  free  to  move  about  untethered. 
Attempts  were  made  to  connect  the  physiological  sensor 
to  a  wireless  microphone  transmitter  module  (Audio 
Technica  ATW-T75),  but  the  output  signal  levels  and 
impedance  were  found  to  be  not  fully  compatible  with  the 
input  stages  available  on  the  wireless  transmitter. 


Although  a  signal  could  be  sent  wirelessly,  additional 
distortions  were  introduced,  which  ultimately  degraded 
speech  recognition  accuracy. 

RSC  has  provided  ARL  with  recommendations  for 
improvements  to  the  design  of  future  Physiological 
Sensor  based  microphones. 

3.2  Army  Research  Laboratory 

ARL  has  conducted  two  experiments  using  the 
Physiological  Sensor  as  an  input  device  for  ASR.  The  first 
effort  used  the  Entropic  HTK  as  the  automatic  speech 
recognition  (ASR)  engine  and  compared  the  capabilities 
of  the  Physiological  Sensor  with  an  acoustic  microphone. 
The  second  effort  utilized  Dragon  Systems  Naturally 
Speaking,  a  commercial  ASR  product  to  evaluate  the 
possibility  of  using  the  Physiological  Sensor  with 
commercial  speech  engines. 

All  applications  of  the  Physiological  Sensor  as  a  speech 
input  device  must  take  into  account  the  difference  in 
frequency  response  of  this  sensor  as  compared  to  a  typical 
airborne  acoustic  microphone.  This  difference  in 
frequency  response  typically  precludes  the  use  of  acoustic 
language  models  provided  with  most  ASR  systems. 

3.2.1  Physiological  Sensor  with  Entropic  HTK 

For  the  experiment  using  HTK,  ARL  teamed  with  the 
United  States  Military  Academy  (USMA)  to  develop 
speech  models  appropriate  for  use  with  the  Physiological 
Sensor  [Bass,  99).  The  Entropic  HTK,  a  Hidden  Markov 
Model  based  system,  was  chosen  because  it  provides  the 
flexibility  required  to  adapt  the  internal  configuration  of 
the  ASR  engine  for  use  with  the  Physiological  Sensor. 

The  test  consisted  of  trying  to  recognize  one  of  50  phrases 
using  both  an  airborne  sensor  (microphone)  and  the 
Physiological  Sensor.  Two  recognizers  were  used,  each 
trained  on  one  of  the  sensors  being  tested.  Phrases 
consisted  of  two  to  ten  words  each,  with  a  total  of  153 
unique  words.  Each  test  subject  spoke  the  phrases  in  an 
environment  that  yielded  speech  to  noise  ratios  of  0-,  3-, 
and  lOdB  SNR  through  the  airborne  sensor,  while 
wearing  both  the  airborne  and  physiological  sensors. 

Speech  training  and  testing  was  conducted  by  USMA  at 
their  facilities.  Training  was  performed  using  data 
collected  from  21  subjects  speaking  the  50  phrases  in  a 
quiet  environment.  The  result  of  the  training  is  a  speaker 
independent  model  for  recognition  of  the  50  test  phrases. 
Testing  was  then  performed  on  data  collected  using  14 
new  subjects  to  speaking  the  50  phrases  in  each  of  the 
given  noise  environments. 
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The  results  of  this  experiment  are  shown  in  tables  1  and  2. 
In  all  cases  the  Physiological  Sensor  and  related 
recognizer  outperformed  the  airborne  acoustic  sensor  and 
related  recognizer  for  the  given  noise  levels.  Further,  the 
%  accuracy  of  the  Physiological  Sensor  degrades  at  a 
much  lower  rate  with  increased  noise  as  compared  to  the 
airborne  acoustic  sensor. 

3.2.2  Physiological  Sensor  with  Dragon 
Naturally  Speaking 

In  order  to  evaluate  other  possible  application  areas  for 
the  Physiological  Sensor  we  decided  to  perform  a  limited 
test  with  a  commercial  ASR  product.  We  selected  Dragon 
Naturally  Speaking  for  the  test  because  we  had 
considerable  experience  using  this  product.  To  simplify 
the  experiment  we  used  the  same  set  of  phrases  as  used 
with  the  HTK  testing.  One  user  trained  the  system  using 
the  standard  user  training  session.  In  addition,  all  of  the 
words  in  the  command  phrases  were  trained  separately. 

With  this  very  limited  data  set,  50  phrases  and  one  user, 
recognition  rates  were  found  to  vary  between  about  60% 
and  80%.  While  not  outstanding,  this  is  a  fairly  good 
result  considering  that  the  ASR  engine  was  developed  for 
an  airborne  acoustic  microphone.  It  should  be  noted  that 
the  worst  recognition  rates  were  obtained  when  the  user 
removed  and  reattached  the  Physiological  Sensor.  We 
assume  that  changes  in  the  sensor  pressure  and  position 
are  the  cause  for  these  variations.  No  tests  were 
performed  in  the  presence  of  noise. 

3.2.3  Future  Research  and  Experimentation 

Experiments  with  the  Physiological  Sensor  have 
demonstrated  its  capability  to  be  used  as  a  speech  sensor 
for  specially  trained  and  configured  ASR  systems.  The 
requirement  for  special  configurations  prevents  the 
application  of  this  sensor  with  many  of  the  commercial 
ASR  products  on  the  market.  Since  the  private  sector  is 
investing  heavily  in  the  development  of  these  continually 
improving  commercial  ASR  products  it  makes  sense  to 
leverage  this  effort.  As  a  result,  ARL  will  work  to  develop 
methods  to  convert  the  output  of  the  Physiological  Sensor 
into  a  signal  that  more  closely  approximates  that  of  an 
acoustic  sensor.  If  we  can  accomplish  this  then  the 
Physiological  Sensor  should  be  suitable  for  use  with  any 
commercial  ASR  product.  The  resulting  system  would 
have  the  improved  capabilities  of  the  commercial  ASR 
products  with  the  noise  rejection  capability  of  the 
Physiological  Sensor. 

4.  Summary,  Conclusions  (Lessons  Learned), 
and  Recommendations 


Several  areas  exist  to  improve  the  operation  of  the 
Physiological  Sensor  as  a  microphone.  The  sensor  already 
has  good  airborne  noise  rejection,  but  more  can  be  done 
to  limit  the  amount  of  airborne  noise  that  couples  to  the 
sensor.  An  acoustic  insulation  material  can  be 
incorporated  around  the  shroud  of  the  sensor  that  contacts 
the  skin  to  prevent  the  airborne  noise  from  contacting  the 
sensor’s  gel  pad.  Additionally,  sensors  could  be  mounted 
on  both  sides  of  the  throat  and  their  outputs  summed 
simultaneously  so  that  the  speech  would  add 
constructively,  whereas  the  noise  would  be  reduced  by 
common  mode  rejection.  Since  the  vocal  folds  are  not 
always  symmetrical,  the  combined  left  and  right  signal 
may  improve  intelligibility  through  construction  of  an 
enhanced  signal. 

One  potential  problem  in  the  application  of  the 
Physiological  Sensor  as  an  input  to  ASR  systems  is  the 
substantial  variation  in  signal  due  to  changes  in  sensor 
pressure  and  position.  We  will  research  this  issue  in  the 
future  and  attempt  to  minimize  these  effects  in  order  to 
improve  operation  with  ASR  software. 

Circuit  modifications  can  made  to  eliminate  noises  from 
switch  activation,  match  impedance  for  interaction  with 
other  devices,  and  adjust  the  filtering  to  create  a  more 
accurate  representation  of  the  speech.  The  preamplifier 
used  in  all  of  the  experiments  described  herein  had  a  flat 
response,  and  did  not  enhance  or  boost  the  high 
frequencies  that  are  lower  in  amplitude  than  the  very 
dominant  lower  formants.  Developing  a  non-linear 
amplifier  (filter)  can  reduce  the  “through  the  wall” 
perception  developed  by  some  listeners,  and  may  produce 
waveforms  that  better  match  what  the  commercial  ASR 
engines  expect.  In  addition,  refinement  of  ergonomics  and 
packaging  would  be  worthwhile  for  maturing  this 
technology  into  a  product. 

The  physiological  sensor  has  demonstrated  exceptional 
capabilities  for  the  detection  of  voice  in  high  noise 
environments.  In  addition,  the  physiological  parameters 
detected  by  this  sensor  provide  health  and  performance 
indication,  but  might  ultimately  provide  invaluable 
emotional  or  physiological  data  that  can  be  used  to  adapt 
and  optimize  ASR  algorithms  under  diverse  situations. 
This  is  important  in  almost  every  military  and  civilian 
application.  Acoustics  can  provide  invaluable  clues  to 
help  understand  the  interrelations  between  the  soldier’s 
physiology,  the  task  at  hand,  the  spoken  word’s  intent, 
and  the  surrounding  environment. 

Areas  requiring  future  research  include  the  development 
of  a  user  independent  HMM  model  set  to  assist 
developers  working  with  of  the  Physiological  Sensor, 
development  of  algorithms  or  filters  to  enhance  operation 
of  the  sensor  for  use  with  commercial  ASR  products,  and 
refinements  in  overall  operation. 
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Figure  1:  Gel  sensor  pad.  Figure  2:  Neck  assembly  for  voice. 


Figure  3:  Sensor  in  helmet  headband. 
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Figure  6:  Boom  microphone  detecting  voice. 
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A  Surface  Vibration  Electromagnetic  Speech 

Sensor 

Jonathan  L.  Geisheimer,  Eugene  F.  Greneker,  Scott  A.  Billington,  Ittichote  Chuckpaiwong 


Abstract — As  researchers  continue  to  improve  speech  in  noisy 
environments,  more  interest  is  being  placed  on  sensors  with 
modalities  that  can  be  fused  with  traditional  acoustic  sensors. 
The  standard  literature  has  shown  that  electromagnetic  sensors 
can  be  used  to  detect  glottal  motion.  Also,  accelerometers  placed 
on  the  throat  and  nasal  areas  have  been  used  to  detect  skin 
surface  vibrations  corresponding  to  speech  and  that  data  has 
been  used  for  noise  reduction.  The  Georgia  Tech  Research 
Institute  (GTRI)  is  transitioning  a  24  GHz  radar  technology 
originally  used  for  non-contact  vital  signs  monitoring  to  a 
technology  able  to  measure  surface  motion  on  the  order  of 
microns,  which  can  detect  skin  surface  vibrations  corresponding 
to  speech.  The  radar  has  been  shown  to  measure  the  same 
motion  as  accelerometers  using  electromagnetic  waves.  This 
paper  describes  the  theory  and  preliminary  work  in  developing  a 
surface  vibration  electromagnetic  speech  sensor  to  be  used  for 
noise  reduction  in  conjunction  with  acoustic  sensors. 

Index  Terms — radar,  speech,  noisy  environments,  sensor 
fusion. 


1.  Introduction 

Every  time  a  person  speaks,  the  acoustical  pressure  waves 
from  speech  couple  through  many  parts  of  the  body, 
which  causes  structures  such  as  the  head,  neck,  chest,  and  face 
to  vibrate.  If  a  hand  is  placed  on  the  chest  or  throat  when 
speaking,  these  vibrations  can  be  readily  felt.  The  acoustic 
pressure  waves  due  to  speech  have  been  translated  to 
mechanical  vibrations.  This  has  been  confirmed  by  various 
researchers  who  have  looked  at  the  head  and  chest  vibrations 
in  signers.'  Other  researchers  have  detected  mechanical 
vibrations  off  of  the  neck  using  contact  accelerometers  and 
have  been  successful  in  using  the  resultant  vibration  signal  to 
cancel  noise  when  fused  with  acoustic  data.2,3 

An  electromagnetic-based  sensor  called  the  Glottal 
Electromagnetic  Micropower  Sensor  (GEMS),  developed  at 
Lawerence  Livermore  National  Laboratories,4  has  been  used 
to  detect  internal  body  vibrations.  This  sensor  uses  a  low 
power,  wideband  pulsed  radar  that  is  able  to  penetrate  through 
the  body  and  detect  glottal  movement.5  It  operates  at 
microwave  frequencies  less  than  3.0  GHz.  In  general,  lower 
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microwave  frequencies  will  achieve  better  penetration  into  the 
body. 

The  surface  vibration  electromagnetic  speech  sensor 
concept  uses  electromagnetic  waves  in  the  millimeter  wave 
region  to  measure  the  slight  vibrations  of  the  body  on  the  skin 
corresponding  to  human  speech,  down  to  micron  levels  of 
motion.  At  the  proposed  operational  frequency  of  35.0  GHz, 
the  electromagnetic  waves  pass  through  clothes  but  do  not 
penetrate  into  the  body  as  does  the  GEMS  sensor.  The  radar 
is  detecting  only  surface  vibrations  and  therefore  directly 
measures  the  surface  skin  vibration  and  not  the  internal  body 
structures.  Since  the  device  is  directly  picking  up  speech 
vibrations,  it  will  be  referred  to  as  a  “radar  microphone”.  A 
diagram  of  the  concept  is  shown  in  Figure  1. 
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Figure  1.  Radar  Microphone  concept 


Referring  to  Figure  1,  the  radar  microphone  transmits  a 
continuous  wave  (CW)  electromagnetic  signal  towards  the 
person’s  chest  or  neck  area.  Next,  the  signal  is  reflected  back 
to  the  sensor  where  it  is  demodulated  and  converted  to  a 
baseband  signal,  sampled  by  an  analog-to-digital  converter, 
and  then  run  through  digital  signal  processing  algorithms  to 
convert  the  radar  signal  into  displacement  that  correlates  with 
the  surface  vibrations  due  to  speech.  The  resultant  speech 
signal  can  then  be  fused  with  other  more  traditional  speech 
sensors  and  then  passed  on  to  an  automatic  speech  recognition 
system  if  desired. 


II.  technology  Background 

The  Georgia  Tech  Research  Institute  (GTRI)  has  been 
sensing  small-scale  biological  motion  using  radar  for  almost 
20  years,  beginning  with  the  Radar  Vital  Signs  Monitor 
(RVSM).  RVSM  technology  is  able  to  detect  both  respiration 
and  heartbeat  signatures  from  individuals  without  contact. 
The  first  GTRI  RVSM  system  was  developed  in  the  mid- 
1 980s  under  sponsorship  of  the  United  States  Department  of 
Defense  (DOD);  a  patent  on  the  system  was  issued  in  1992.6 
This  frequency  modulated  (FM)  radar  was  used  as  a  battlefield 
vital  signs  monitor.  The  system  was  tested  on  soldiers 
wearing  a  chemical  or  biological  warfare  suit  to  allow  vital 
signs  to  be  monitored  without  opening  the  suit  and  risking 
contamination  of  the  subject.7 
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A  later  version  of  the  RVSM  was  developed  for  use  in  the 
1 996  Olympics  held  in  Atlanta,  Georgia  and  was  addressed  in 
a  paper  presented  by  one  of  the  authors.8  This  system  was 
built  to  monitor  the  heartbeat  of  competitors  in  the  archery  and 
rifle  events  and  was  able  to  penetrate  through  the  heavy 
leather  flak  jackets  typically  used  by  competitors.  Finally,  a 
variant  called  the  RADAR  Flashlight  was  developed  for  use 
by  law  enforcement  personnel  to  detect  the  radar  respiration 
signature  of  individuals  concealed  behind  a  wall  or  within  an 
enclosed  space  under  the  sponsorship  of  the  National  Institute 
of  Justice  (N1J).9  A  picture  of  the  latest  Radar  Flashlight 
prototype  is  shown  in  Figure  2. 


Figure  2.  Radar  Flashlight  prototype 

Recent  advances  in  the  technology  have  increased  the 
resolution  of  the  sensor  so  it  is  able  to  detect  motion  on  the 
order  of  microns.  The  associated  hardware  and  signal 
processing  advancements  have  now  enabled  the  sensor  to 
detected  vibrational  skin  motion  associated  with  speech 
directly  off  of  the  body. 

III.  SURFACE  VIBRATION  SPEECH  SENSOR  THEORY 

The  radar  microphone  is  based  on  a  phase  detection 
technique  to  achieve  a  sensitivity  high  enough  to  pick  up 
surface  vibrations  due  to  human  speech.  The  key  to  the 
technique  is  that  it  does  NOT  use  the  Doppler  effect  or  time  of 
flight  measurements  common  in  most  traditional  radar 
designs.  The  key  to  the  GTRI  technique  is  that  the  sub¬ 
wavelength  phase  is  measured  with  high  accuracy.  Motion 
less  than  the  transmitted  wavelength  is  being  measured. 

The  radar  microphone  detects  motion  similar  to  a  laser 
vibrometer,  however,  millimeter  microwaves  are  used  instead 
of  light  and  a  homodyne  detection  technique  is  being  used 
instead  of  an  interferometer.  Typically,  when  electromagnetic 
waves  are  used  in  the  context  of  radar  or  other  remote  sensing 
applications,  the  object  of  interest  is  moving  through  multiple 
wavelengths.  If  that  object  is  moving  relative  to  the 
transmitter,  the  received  frequency  will  be  different  then  the 
transmit  frequency.  This  is  the  well-known  Doppler  effect. 
However,  when  an  object  moves  less  than  a  wavelength,  such 
as  the  case  in  detecting  chest  vibrations,  a  different 
phenomenology,  phase  modulation,  is  at  work. 

To  prove  the  basic  fundamentals  of  the  concept,  the 
vibration  of  the  chest  was  first  recorded  with  a  contact 
accelerometer  and  the  corresponding  acoustic  speech  was 
recorded  with  a  microphone.  The  accelerometer  was  a  high 
frequency  PCB  352C68  placed  on  the  chest  and  the 
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microphone  was  a  standard  acoustical  transducer.  The 
simultaneously  recorded  output  from  the  two  sensors  for  the 
segment  of  speech  “hickory  dickory  dock”  is  shown  in  Figure 
3.  The  accelerometer  data  clearly  shows  many  of  the  same 
characteristics  as  the  audio  signal.  The  radar  microphone  will 
measure  the  same  vibrations  as  the  accelerometer  in  a  non- 
contact  manner.  Past  research  by  the  authors  has  shown  that 
signal  detected  by  the  radar  correlates  well  with  accelerometer 
outputs.10 


Microphone  Output 


time 


Figure  3.  Simultaneous  microphone  and  accelerometer 
speech  data  for  “hickory  dickory  dock” 

IV.  Prototypes 

A  prototype  has  been  constructed  to  demonstrate  the 
technology  for  a  different  application;  however,  the  results  are 
useful  to  show  the  current  state  of  the  technology  as  well  as 
the  promise  of  the  radar  microphone.  The  resulting  hardware 
was  tested  using  a  linear  motor  with  an  optical  encoder. 

Figure  4  depicts  the  hardware  configuration  of  the  test 
setup.  A  target  was  attached  tightly  to  a  moving  portion  of  a 
linear  motor.  The  target  surface  was  covered  with  a  flat  metal 
sheet  that  is  used  as  a  reflector.  The  radar  sensor  and  the 
linear-motor  encoder  were  set  to  take  simultaneous 
measurements.  The  displacement  from  the  radar  sensor  and 
the  encoder  were  compared,  consequently  the  radar  sensor 
could  be  calibrated  and  compared. 


Radar 


Target 

i — 

Signal 

Processing 

DAQ 

Computer 

1 - ■  ! 

Linear  Motor  1 

Figure  4.  Radar  microphone  test  setup 


The  results  are  illustrated  in  Figure  5.  The  top  graph  is  a 
plot  of  both  the  radar  sensed  motion,  and  the  ground  truth 
motion  as  recorded  by  the  encoder.  It  can  be  seen  that  the 
radar  sensor  was  able  to  track  actual  displacement  of  an 


arbitrary  motion.  The  residual  (difference  between  the  radar 
and  encoder  calculated  displacement)  on  the  lower  graph  is  the 
difference  between  displacements  measured  by  the  radar 
sensor  and  the  reference,  or  error,  of  the  radar  sensor. 
According  to  this  graph,  the  accuracy  of  the  radar  sensor  can 
be  given  to  within  ±1  mm  over  a  displacement  range  of 
50mm.  Looking  at  smaller  portions  of  the  displacement,  it  can 
be  seen  that  the  error  if  often  less  than  0.1  mm. 

Also,  the  residual  being  measured  in  this  case  is  absolute 
displacement.  Relative  displacement  errors  have  been 
measured  down  to  20  microns.  Note  that  the  residual  is  not 
randomly  distributed,  but  a  periodic  function  of  displacement. 
The  periodic  error  is  caused  by  multipath  reflections  between 
the  metal  target  and  the  metal  radar  hardware.  Sensing  of 
speech  motion  will  yield  significantly  less  multipath  and 
distortion  due  to  the  less  coherent  reflecting  surface. 

Compare  the  displacement  calculated  from  the  radar  signal  and  the  encoder 


time  (second) 


Figure  5.  Example  data  taken  from  test  setup 

Some  initial  recordings  have  been  taken  using  this  prototype 
along  with  simultaneous  acoustic  recordings.  After  processing 
the  radar  signal,  the  presence  of  speech  information  is  readily 
apparent  at  frequencies  bellow  500  Hz  and  the  signal 
correlates  well  with  the  acoustic  data,  however,  the  radar- 
derived  speech  is  not  yet  intelligible.  Increases  in 
performance  will  occur  both  through  signal  processing,  as  well 
as  better  antenna  design,  which  will  increase  the  frequency 
response,  as  discussed  below. 

V.  Modal  Analysis 

Critical  to  the  successful  operation  of  a  radar  microphone 
is  the  “spot  size”  of  microwave  energy  illuminated  by  the 
antenna.  This  is  critical  because  the  sensor  is  measuring 
vibrations  that  are  propagating  along  the  surface  of  the  chest. 
Waves  with  peaks  and  nulls  are  moving  through  the  chest  at 
different  frequencies.  One  analogy  would  be  the  waves  that 
move  outward  in  water  when  a  stone  is  dropped  into  a  pond. 
There  are  peaks  and  nulls  in  the  water  corresponding  to  the 
propagating  surface  waves. 


The  work  of  Dr.  Kevin  Riggs  at  Stetson  University  has 
produced  holographic  images  of  vibratory  modes  in  different 
materials.  Figure  6  shows  an  example  vibratory  mode  for  a 
six  inch  square  steel  plate.  The  peaks  and  nulls  on  the  plate 
are  readily  apparent.  It  is  critical  for  accurate  measurement  of 
the  vibration  signal  that  the  illumination  area  not  detect  both 
peaks  and  nulls  at  the  same  time,  which  may  smear  the  output 
signal  in  the  frequency  domain. 

Because  the  radar  is  receiving  the  sum  of  reflections  from 
all  illuminated  points,  the  peaks  and  nulls  could  cancel  each 
other  out  and  distort  the  signal  of  interest.  Therefore,  the 
bandwidth  of  the  radar  microphone  is  limited  by  the  antenna 
spot  size  on  the  chest.  The  smaller  the  spot  size,  the  higher  the 
frequencies  that  can  be  adequately  picked  up  by  the  sensor. 


Figure  6.  Example  image  of  vibratory  modes  on  a  steel 
plate  (K.  Riggs,  Stetson  University) 

As  the  standoff  distance  from  the  radar  to  the  target  of 
interest  increases,  the  area  illuminated  by  the  radar  beam 
increases,  affecting  the  frequency  sensitivity  of  the  sensor. 
The  spot  size  in  centimeters  vs.  distance  in  meters  for  various 
antenna  beam  sizez  (in  degrees)  is  shown  in  Figure  7. 
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Figure  7.  Spot  size  for  given  antenna  beamwidths  and 
distances 

For  the  sensor  to  be  viable,  an  antenna  must  be  designed 
that  projects  a  small  spot  size  onto  the  neck,  face,  or  chest  of 
the  person.  If  the  application  is  in  traditional  military 
communications,  the  soldier  or  pilot  will  typically  be  wearing 
a  headset,  to  which  a  sensor  can  be  placed  close  to  the  face  or 
neck.  For  larger  standoffs,  more  exotic  antennas  will  need  to 
be  designed.  Moving  the  radar  to  a  higher  transmitted 
frequency  will  also  enable  smaller  spot  sizes,  enhanced 
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resolution,  and  improved  frequency  response.  As  advances  in 
commercial  radar  technology  drive  prices  down  for  operating 
at  higher  frequencies  (such  as  77  GHz  for  automobile  collision 
control),  the  ability  of  the  technology  to  detect  high  resolution 
speech  will  be  improved. 

VI.  CONCLUSION  &  Future  directions 

The  concept  of  using  a  radar  device  as  a  surface  vibration 
electromagnetic  speech  sensor  has  been  introduced.  The  radar 
acts  as  a  sensitive  motion  detector  able  to  detect  the  surface 
vibration  of  skin  due  to  speech.  Testing  of  a  35.0  GHz  sensor 
has  shown  the  ability  to  measure  motion  down  to  microns. 
The  next  step  is  to  take  the  35.0  GHz  radar  sensor  and  record  a 
corpus  of  simultaneous  radar  and  audio  data  to  process  and 
compare.  Signal  processing  algorithms  will  be  necessary  to 
extract  speech  information  out  of  the  radar  data.  Initial 
recordings  using  the  sensor  have  shown  the  presence  of  speech 
information  at  500  Hz  and  below  in  the  radar  signal. 
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Abstract 

This  paper  addresses  the  testing  and  analyzing  of 
various  microphones  versus  the  Physiological 
Microphone  (provided  by  Pete  Fisher  of  the 
Army  Research  Laboratory)  in  different  working 
conditions  [1,2].  We  explore  different 
techniques  and  environments  in  which  a  user 
interfaces  a  selected  ASR  program.  The  testing 
of  multiple  microphones  provided  us  with  varied 
results  based  on  environment.  The  software  of 
choice  for  our  research  was  Dragon  Naturally 
Speaking  5.0. 

1.  Introduction 

Automatic  Speech  Recognition  systems  enable 
users  to  operate  their  computer  through  the  use 
of  their  voice.  This  advancement  has  benefited 
casual  consumers,  professionals  and  handicapped 
individuals  alike.  The  development  of  a 
microphone  allowing  the  user  to  move  about 
freely  and  eliminate  background  noise  has 
become  necessary  for  practical  use  by 
professionals  and  consumers  alike.  Although 
significant  progress  has  been  made  in  ASR  there 
are  still  limitations  that  must  be  taken  into 
consideration.  The  technology  that  is  on  the 
market  for  consumers  today,  operates  efficiently 
only  under  controlled  conditions  and  through 
dictation,  not  conversation. 

Factors  to  be  considered  in  recognition  accuracy: 

•  Environment  (background  noise, 
room  size) 

•  Computer  Hardware  (CPU  speed, 
RAM,  soundcard) 

•  Amount  of  training  with  software 

•  Position  of  microphone 

•  Speaking  style  and  clarity 
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•  Microphone  type 

•  Variability  in  the  consumers  speech 
(e.g.,  stress,  colds) 

These  factors  are  considered  to  determine  the 
most  effective  speech  recognition  procedure  for 
each  microphone  based  on  environment. 

2.  System  Descriptions 

Our  research  was  recorded  based  on  the  results 
provided  by  two  test  machines.  The  machines 
were  both  using  Intel  based  processors. 

System  A 

•  Pentium  III  0.5  GHz 

•  256  Mb  pcl33  RAM 

•  Yamaha  DS-XG  Sound  Card 

System  B 

•  Pentium  IV  1 .4  GHz 

•  256  Mb  RDRAM 

•  Sound  Blaster  Live!  5.1 

System  C 

•  Pentium  IV  1 .4  GHz 

•  256  Mb  RDRAM 

•  Sound  Blaster  Live!  5.1 

System  D 

•  Pentium  IV  1 .8  GHz 

•  256  Mb  RDRAM 

•  SoundBlaster  Live!  5.1 

The  testing  phase  of  the  research  continued 
through  the  use  of  four  styles  of  microphones. 
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Microphone  types: 

•  Telex  H-551  Headset  Microphone 
(Reference  Mic.)  (System  B) 

-  USB  digital  stereo  headset 

•  Physiological  Microphone  (P-Mic) 

-  Throat  Microphone  that  detects 
vibration  through  skin  and  bone 
(System  A) 

•  Telex  M-60 

Super-directional  linear  array 
microphone  (System  C) 

•  Telex  M-40 

Standard  desktop  microphone 

(System  D) 

Our  findings  were  based  on  the  aforementioned 
hardware  combined  with  a  predetermined 
method  of  testing.  All  computers  exceeded  the 
hardware  requirements  of  Dragon  Naturally 
Speaking  v5.0.  Through  preliminary  testing,  we 
found  all  recognizer  engines  operated  at  the  same 
speed  when  dictating.  Therefore,  microphones 
were  arbitrarily  assigned  to  each  computer. 

3.  P-Mic  Description 

The  Physiological  Microphone  is  optimized  for 
hands-free  use.  The  microphone  is  designed  to 
eliminate  most  background  noise.  It  has  its  own 
power  source,  which  is  a  7.5-volt  silver-oxide 
battery.  Two  of  the  microphones  we  used  were  a 
stationary  desktop  microphone  (Telex  M-40)  and 
a  super-directional  linear  array  based 
microphone  (Telex  M-60).  The  P-Mic  has  a 
power  switch  allowing  the  user  to  pause  in 
dictation  with  out  having  to  remove  the 
microphone  or  stop  the  program.  The  Telex  M- 
40  is  lacking  a  power  switch,  which  inconvenient 
in  ASR.  Physically,  the  P-Mic  does  not 
resemble  a  typical  microphone.  The  P-Mic  is 
worn  like  a  collar,  and  has  a  silicon  contact 
sensor  which  is  placed  slightly  to  the  left  or  right 
of  the  throat,  due  to  the  symmetrical  nature  of 
the  throat.  The  P-Mic  is  small  and  lightweight. 
The  width  of  the  collar  and  diameter  of  the 
sensor  is  about  1  inch.  With  the  P-Mic  the  user 
can  move  about  freely  and  have  both  hands 
available.  Traditional  microphones  used  in  ASR 
require  that  the  user  remain  stationary,  thus 
limiting  productivity  in  the  workplace.  The  P- 
Mic  plugs  into  the  "Line-In"  jack  on  the  sound 
card  via  a  phono  plug,  whereas  traditional 
microphones  use  the  microphone  jack. 


4.  Procedure  for  Microphone  Testing 

Testing  was  performed  in  a  typical,  quiet 
research  laboratory  environment.  Our  research 
lab’s  dimensions  are  22’  x  17'.  The  room  is  prone 
to  little  outside  noise  interference.  A  radio 
playing  a  recorded  talk  radio  conversation  at 
variable  volumes  was  used  to  produce 
background  noise.  The  recorded  talk  radio  show 
was  selected  for  consistency,  allowing  each 
microphone  to  be  subject  to  the  same 
interference.  The  simulated  conversation  source 
was  emitted  10'  behind  the  speaker. 

Before  testing  we  positioned  four  computers 
such  that  they  could  be  tested  simultaneously  by 
one  user.  Each  of  the  four  microphones  was 
assigned  arbitrarily  to  a  computer.  We  then 
performed  the  basic  training  required  according 
to  the  Dragon  Naturally  Speaking 
documentation.  Next  a  400-word  passage  was 
dictated  once  while  correcting  and  training  all 
errors  that  occurred.  The  400-word  passage 
contained  general  vocabulary.  After  training,  the 
Telex  M-40  and  Telex  M-60  were  attached  to  a 
microphone  stand  and  positioned  directly  in  front 
of  the  speaker.  The  user  then  attached  the  H-55 1 
and  the  P-mic  enabling  all  four  microphones  to 
be  tested  at  the  same  time.  The  speaker  tested 
each  microphone  with  background  noise  set  at; 
no  additional  noise,  60dB,  70dB,  and  80dB 
respectively.  The  environment  where  we  tested 
had  an  average  of  50  dB  of  background  noise. 
The  quiet  conditions  were  to  facilitate  the  peak 
performance  of  each  of  the  four  microphones. 

The  speaker  then  started  Dragon  Naturally 
Speaking  on  all  four  computers.  The  speaker 
read  the  passage  speaking  at  an  average  volume 
of  80dB.  With  the  speaker  speaking  at  80  dB 
and  noise  at  50  dB,  the  difference  of  30  dB 
provides  an  ideal  speech-to-noise  ratio  for  ASR. 
The  speaker’s  volume  was  chosen  to  keep  him 
from  resisting  the  urge  to  compete  with  the 
added  background  noise,  especially  at  the  highest 
level  of  noise  (80dB).  This  allowed  the 
experiment  to  be  performed  at  speech-to-noise 
ratios  varying  from  excellent  to  very  poor  for 
speech  recognition  purposes.  Each  test  was 
performed  three  times  per  sound  level  and  the 
results  were  averaged.  The  dictated  passages 
were  printed  and  saved  for  analysis  of  mistakes 
made  during  dictation. 
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5.  Results 

The  results  for  the  four  microphones  tested  are 
documented  in  the  plot  below.  Results  per 
microphone  in  each  environment  are  the  average 
of  three  test  sessions,  recording  the  accuracy 
rate.  The  equation  we  used  was  [(Errors  /  Total 
Words)*  100=  Percent  Error];  then,  [100  - 
Percent  Error  =  Accuracy  Rate].  Each 
capitalization  error,  period,  paragraph 
indentation,  etc.  was  counted  as  an  error,  and  a 
wrong  word  or  a  skipped  word  was  counted  as 
one  error.  Therefore,  Type  I  and  Type  II  errors 
were  counted  as  one  error.  Multiple  word 
phrases  recorded  in  error  in  the  place  of  one 
word  were  counted  as  one  error  (example:  user 
says,  "comma"  and  program  records,  "come  on", 
=  one  error). 

Table  1  contains  the  results  for  the  microphones 
tested  at  each  level  of  background  noise.  The 
last  column  depicts  the  total  percentage  change 
from  quiet  conditions  to  80dB  background  noise. 


Table  1.  (Performance  in  %) 


Mic. 

Type 

No 

Noise 

60dB 

70dB 

80dB 

Total 

Chg. 

H551 

99.0 

98.5 

96.5 

89.75 

9.25% 

M-60 

98.75 

97.25 

92.5 

85.25 

13.5% 

M-40 

95.5 

94.25 

87.5 

81.75 

13.75% 

P-Mic 

97.5 

96.0 

93.75 

92.0 

5.5% 

The  graph  below  illustrates  that  microphone 
performance  was  above  94%  accuracy  when 
speech-to-noise  ratios  were  ideal.  Notice  that  the 
steepest  drop  for  the  acoustic  microphones 
occurred  between  70  and  80  dB,  whereas  the 
slope  of  the  P-Mic  continues  along  a  fairly 
straight  line.  The  P-Mic  never  dropped  more 
than  3%  between  increased  levels  of  background 
noise. 


Microphone  Resofts 


Figure.  1  (Combined  Results) 

Table  2  breaks  down  the  percent  change  in 
increased  background  noise.  The  acoustic 
microphones’  performance  all  dropped  in 
parallel  as  the  levels  of  background  noise  were 
increased.  The  P-Mic’s  performance,  on  the 
other  hand,  did  not  decrease  at  a  higher 
percentage  with  the  addition  of  background 
noise.  (Specifically  from  60  to  70dB  versus  70 
to  80dB. 


Table  2.  (Percent  Change) 


Mic. 

Type 

No  Noise 
to  60dB 

60  to 

70dB 

70  to 

80dB 

H551 

0.5% 

2.0% 

5.25% 

M-60 

1.5% 

4.75% 

7.25% 

M-40 

1.25% 

2.75% 

5.75% 

P-Mic 

1.5% 

2.25% 

1.75% 

5.  Conclusions 


It  is  concluded  that  the  Physiological 
Microphone  out  performed  its  competition  the 
most  at  the  most  stressful  speech-to-noise  ratios. 
The  physiological  microphone’s  performance 
was  relatively  unhampered  by  very  poor  speech- 
to-noise  ratios.  Our  acoustic  microphones’ 
largest  drop  in  recognition  accuracy  occurred  at 
80dB.  The  acoustic  microphones  dropped  at 
least  5%  at  this  level,  whereas  the  P-Mic  dropped 
only  1 .75%.  The  P-Mic’s  total  percent  change  of 
errors  was  about  to  half  that  of  the  reference 
microphone.  Although  the  P-Mic  performed 
above  the  rest,  the  99%  accuracy  at  quiet 
conditions  still  eluded  it.  Our  data  leads  us  to 
believe  that  the  P-Mic  has  great  potential  when 
used  in  high  background  noise  areas.  We  feel 
that  the  addition  of  an  acoustic  sensor  used  in 
tandem  with  the  Physiological  Microphone  will 
boost  recognition  accuracy. 

6.  Future  Endeavors 

In  the  near  future,  we  plan  on  acquiring  a  more 
accurate  sound  level  meter,  with  a  low  range  of 
30dB.  We  would  also  like  to  acquire  an 
electronic  mouth  to  aid  in  our  normalization 
process.  Plans  to  create  and  implement  a 
throat/neck  simulator  are  also  being  arranged. 
This  simulator,  used  with  the  electronic  mouth 
will  allow  for  a  minimum  of  user  errors  and  a 
near  complete  normalization  of  the  test 
environment  when  using  a  pre-recorded  file.  We 
are  also  interested  in  acquiring  other  throat 
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sensors  and  testing  their  performance  versus  the 
Physiological  Microphone. 
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ABSTRACT 

In  this  work  we  consider  the  bimodal  fusion  problem  in  audio¬ 
visual  speech  recognition.  A  novel  sensory  fusion  architecture 
based  on  the  coupled  hidden  Markov  models  (CHMMs)  is  pre¬ 
sented.  CHMMs  are  directed  graphical  models  of  stochastic 
processes  and  are  a  special  type  of  dynamic  Bayesian  networks. 
The  proposed  fusion  architecture  allows  us  to  address  the  statis¬ 
tical  modeling  and  the  fusion  of  audio-visual  speech  in  a  unified 
framework.  Furthermore,  the  architecture  is  capable  of  capturing 
the  asynchronous  and  temporal  inter-modal  dependencies  be¬ 
tween  the  two  information  channels.  We  describe  a  model  trans¬ 
formation  strategy  to  facilitate  inference  and  learning  in 
CHMMs.  Results  from  audio-visual  speech  recognition  experi¬ 
ments  confirmed  the  superior  capability  of  the  proposed  fusion 
architecture. 

1.  INTRODUCTION 

Incorporating  visual  information  into  automatic  speech  recogni¬ 
tion  (ASR)  has  been  demonstrated  as  an  effective  approach  to 
improve  the  performance  and  robustness  over  the  audio-only 
systems,  and  has  received  much  attention  in  recent  years  [7], 
One  of  the  most  challenging  issues  in  bimodal  ASR  is  how  to 
fuse  the  audio  (i.e.  acoustic  speech  signal)  and  the  visual  (i.e.  lip 
motion)  modalities. 

The  fusion  of  audio  and  visual  speech  is  an  instance  of  the 
general  sensory  fusion  problem.  The  sensory  fusion  problem 
arises  in  the  situation  when  multiple  channels  carry  complemen¬ 
tary  information  about  different  components  of  a  system.  In  the 
case  of  audio-visual  speech,  the  two  modalities  manifest  two 
aspects  of  the  same  underlying  speech  production  process.  From 
an  observer’s  view,  the  audio  channel  and  the  visual  channel 
represent  two  interacting  stochastic  processes.  We  seek  a 
framework  that  can  model  the  two  individual  processes  as  well 
as  their  dynamic  interactions. 

One  interesting  aspect  of  audio-visual  speech  is  the  inherent 
asynchrony  between  the  audio  and  visual  channels.  Most  early 
integration  approaches  to  the  fusion  problem  assume  tight  syn¬ 
chrony  between  the  two.  However,  studies  have  shown  that 
human  perception  of  bimodal  speech  does  not  require  rigid  syn¬ 
chronization  of  the  two  modalities  [6],  Furthermore,  humans 
appear  to  use  the  audio-visual  asynchronies  as  multimodal  fea¬ 
tures.  For  example,  it  is  well  known  that  the  voice  onset  time 


(VOT)  is  an  important  cue  to  the  voicing  feature  in  stop  conso¬ 
nants.  This  information  can  be  conveyed  bimodally  by  the  inter¬ 
val  between  seeing  the  stop  release  and  hearing  the  vocal  cord 
vibration.  Therefore,  a  successful  fusion  scheme  should  not  only 
be  tolerant  to  asynchrony  between  the  audio  and  visual  cues,  but 
also  be  apt  to  capture  and  exploit  this  bimodal  feature. 

2.  SENSORY  FUSION  USING  CHMMS 

It’s  a  fundamental  problem  to  model  stochastic  processes  that 
have  structure  in  time.  A  number  of  frameworks  have  been  pro¬ 
posed  to  formulate  problems  of  this  kind.  Among  them  is  the 
hidden  Markov  model  (HMM),  which  has  found  great  success  in 
the  field  of  ASR.  In  recent  years,  a  more  general  framework,  the 
Dynamic  Bayesian  Networks  (DBNs),  has  emerged  as  a  power¬ 
ful  and  flexible  tool  to  model  complex  stochastic  processes  [3]. 


Figure  1.  DBN  representation  of  an  HMM 


The  DBNs  generalize  the  hidden  Markov  models  by  representing 
the  hidden  states  as  state  variables,  and  allow  the  states  to  have 
complex  interdependencies.  Under  the  DBNs  framework,  the 
conventional  HMM  is  just  a  special  case  with  only  one  state 
variable  in  a  time  slice.  DBNs  are  commonly  depicted  graphi¬ 
cally  in  the  form  of  probabilistic  inference  graphs.  An  HMM 
can  be  represented  in  this  form  by  rolling  out  the  state  machine 
in  time,  as  shown  in  Figure  1 .  Under  this  representation,  each 
vertical  slice  represents  a  time  step.  The  circular  node  in  each 
slice  is  the  multinomial  state  variable,  and  the  square  node  in 
each  slice  represents  the  observation  variable.  The  directed  links 
signify  conditional  dependence  between  nodes. 

It  is  possible  to  just  use  HMM  to  carry  out  the  modeling 
and  fusion  of  multiple  information  sources.  This  can  be  accom¬ 
plished  by  attaching  multiple  observation  variables  to  the  state 
variable,  and  each  observation  variable  corresponds  to  one  of  the 
information  sources.  Figure  2  illustrates  the  fusion  of  audio  and 
visual  information  using  this  scheme.  Because  both  channels 
share  the  single  state  variable,  this  approach  in  effect  assumes 
the  two  information  sources  always  evolves  in  lockstep.  There- 
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f=  1  f=  2  f=  3  t=  T 


Figure  2.  Audio-visual  fusion  using  HMM 

fore,  it  is  not  able  to  model  asynchronies  between  the  two  chan¬ 
nels. 

An  interesting  instance  of  the  DBNs  is  the  so-called  Cou¬ 
pled  hidden  Markov  models  (CHMMs).  The  name  CHMMs 
comes  from  the  fact  that  these  networks  can  be  viewed  as  paral¬ 
lel  rolled-out  HMM  chains  coupled  through  cross-time  and 
cross-chain  conditional  probabilities.  In  the  perspective  of 
DBNs,  an  u-chain  CHMM  has  n  hidden  nodes  in  a  time  slice, 
each  connected  to  itself  and  its  nearest  neighbors  in  the  next  time 
slice.  For  the  purpose  of  audio-visual  speech  modeling,  we  con¬ 
sidered  the  case  of  «  = 2,  or  the  2-chain  CHMMs.  Figure  3  shows 
the  inference  graph  of  a  2-chain  CHMM 


acoustic 

channel 


visual 

channel 


t=  1 


t=  2  t=  3 


Figure  3.  Audio-visual  fusion  using  CHMM 


There  are  two  state  variables  in  the  graph.  The  state  of  the  sys¬ 
tem  at  certain  time  slice  is  jointly  determined  by  the  states  of 
these  two  multinomial  variables.  More  importantly,  the  state  of 
each  state  variable  is  dependent  on  both  of  its  two  parents  in  the 
previous  time  slice.  This  configuration  essentially  permits  un¬ 
synchronized  progression  of  the  two  chains,  while  encouraging 
the  two  sub-processes  to  assert  temporal  influence  on  each 
other’s  states.  Note  that  the  Markov  property  is  not  jettisoned  by 
introducing  the  additional  state  variable  and  the  directed  links. 
Given  the  current  state  of  the  system,  the  future  is  conditionally 
independent  of  the  past.  Furthermore,  given  its  two  parents,  a 
state  variable  is  also  conditionally  independent  of  the  other  state 
variable. 

In  addition  to  the  two  state  variables,  there  are  two  observa¬ 
tion  variables  in  each  time  slice.  Each  observation  variable  is  a 
private  child  of  one  of  the  state  variables.  The  observation  vari¬ 


ables  can  be  either  discrete  or  continuous.  It  is  possible  w'ith  this 
framework  that  one  of  the  state  variable  is  continuous  and  the 
other  one  is  discrete. 

In  the  context  of  audio-visual  speech  fusion,  the  audio  and 
visual  channels  are  associated  with  the  two  state  variables  re¬ 
spectively  through  the  observable  nodes.  Inter-channel  asyn¬ 
chrony  is  allowed.  The  overall  dynamics  of  the  audio-visual 
speech  is  determined  by  both  modalities. 

In  general,  the  time  complexity  of  exact  inference  in  DBNs 
is  exponential  in  the  number  of  state  variables  per  time  slice. 
For  systems  with  large  number  of  state  variables,  exact  inference 
quickly  becomes  computationally  intractable.  Consequently, 
much  attention  in  the  literature  has  been  paid  to  approximation 
methods  that  aim  to  solve  the  general  problem.  Existing  ap¬ 
proaches  include  the  variational  methods  [4]  and  the  sampling 
methods  [5).  However,  these  methods  usually  exhibit  nice  com¬ 
putational  properties  in  an  asymptotic  sense.  When  the  number 
of  states  is  very  small,  the  computational  overhead  embedded  in 
the  approximation  method  is  often  large  enough  to  offset  the 
theoretical  reduction  in  time  complexity.  In  this  situation,  the 
approximation  becomes  superfluous  and  exact  inference  be¬ 
comes  more  desirable.  In  the  following  section,  we  describe  a 
model  transformation  strategy  that  facilitates  inference  and  learn¬ 
ing  in  CHMMs. 

3.  CHMM  TRANSFORMATION 

The  state  of  a  2-chain  CHMM  is  jointly  determined  by  the  two 
state  variables  in  the  parallel  chains.  If  the  two  state  variables 
can  take  Q,  and  Q,  discrete  values  respectively,  then  the 
CHMM  in  effect  has  Qt  xQ,  possible  states.  The  same  state 
space  can  also  be  represented  by  a  conventional  HMM  that  has 
Qt  x  Q2  hidden  states.  Moreover,  in  CHMM,  the  output  distri¬ 
bution  of  a  joint  state  can  be  obtained  by  taking  the  product  of 
the  two  output  densities  of  the  two  individual  state  variables; 
Similarly,  in  a  2-stream  HMM,  the  output  distribution  of  a  state 
is  the  product  of  the  two  stream-dependent  densities.  Hence,  it 
is  also  possible  to  represent  the  output  configurations  of  a  2- 
chain  CHMM  with  a  2-stream  HMM  that  has  an  equivalent  state 
space.  However,  the  observable  nodes  of  a  Qt  x  Q2  CHMM  are 
fully  specified  by  a  table  containing  Q,  +  Q2  entries.  On  the 
other  hand,  an  unconstrained  2-stream  HMM  with  Q ,  xQ2  hid¬ 
den  states  has  2xQtxQ2  distinct  output  densities.  This  differ¬ 
ence  arises  because  in  the  CHMM  an  output  node  is  only  de¬ 
pendent  on  its  single  parent,  while  in  the  state-equivalent  HMM 
the  output  is  effectively  conditioned  on  both  state  variables  in 
the  original  CHMM.  Fortunately,  this  discrepancy  can  be  read¬ 
ily  resolved  through  tying  the  appropriate  output  densities  in  the 
2-stream  HMM  according  to  the  mapping  from  CHMM  states  to 
HMM  states.  This  state  mapping  and  parameter  tying  procedure 
is  easy  to  visualize  graphically. 

Figure  4  illustrates  the  state-machine  diagram  of  2-stream 
HMM  obtained  by  transforming  a  2-chain  CHMM  with  Q,  =3 
and  Q2  =  2  .  The  state  space  of  the  original  CHMM  is  repre- 
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Figure  4.  Transform  CHMM  to  HMM  through  state-space 
mapping  and  parameter  tying 


sented  by  the  6  hidden  states  in  the  HMM.  This  mapping  is 
explicitly  depicted  in  the  diagram.  E.g.,  the  state  3  in  the  HMM 
is  equivalent  to  the  state  { q ,  =  2,  q2  =1}  in  the  CHMM.  The 
output  densities  of  the  HMM  are  tied  according  to  the  mapping. 
In  the  figure  above,  the  observation  nodes  with  the  same  color 
shade  are  tied.  For  example,  the  output  densities  modeling  the 
lower  stream  in  state  2,  4,  and  6  are  tied,  because  they  all  corre¬ 
spond  to  the  entry  p(o,  \q-,  =2)  in  the  CPT  of  the  CHMM. 

The  allowed  state  transition  in  the  HMM  is  also  derived 
from  the  state  space  mapping.  In  this  example,  it  is  assumed  that 
the  conditional  probabilities  concerning  the  two  state  variables 
in  the  CHMM  satisfy  the  following  condition. 

P(<lL  k'.?,2)  =  0  and  *91+1  0) 

This  condition  essentially  enforces  the  left-to-right  no-skip  pol¬ 
icy  in  the  sense  of  conventional  HMM  for  the  two  state  variables 
in  the  CHMM,  which  is  commonly  used  in  audio-only  speech 
recognizers.  For  example,  a  possible  state  path  in  the  CHMM 
could  be  {<?,  =1}  -»{?,  =2,?2  =1}->{?i  =3,?2  =2}  , 

this  is  equivalent  to  the  allowed  state  path  I  — >  3  — >  6  in  the 
HMM. 

Other  meaningful  model  configurations  can  be  obtained 
through  manipulating  the  allowed  state  transitions.  For  instance, 
it  might  be  reasonable  to  model  the  dynamics  of  the  lip  motion 
using  an  ergodic  state  variable,  i.e.,  no  restriction  on  the  possible 
state  transitions  for  this  variable. 

It  is  worthy  noting  that  the  2-stream  HMM  approach  to  au¬ 
dio-visual  fusion  as  shown  in  Figure  2  can  be  considered  as  a 
special  case  of  the  CHMM-based  fusion  architecture.  In  that 
case,  the  number  of  the  audio  states  must  be  equal  to  the  number 
visual  states,  and  the  two  state  variables  always  progress  in  lock 
step,  i.e.  Q,  =  Q2  ,  and  q)  =  qj  for  all  t.  The  CHMM-based 
fusion  architecture  permits  a  much  richer  space  for  modeling 
interactions  between  the  two  modalities. 

The  model  transformation  strategy  described  is  fairly  gen¬ 
eral  and  can  be  implemented  on  any  HMM-based  ASR  platforms 
that  support  multiple  observation  streams  and  parameter  tying. 


The  experiments  carry  two  objectives.  The  first  is  to  evaluate 
the  improvement  in  noise  robustness  brought  by  the  bimodal 
approach  to  ASR.  The  second  is  to  compare  the  performance  of 
the  proposed  fusion  architecture  with  other  fusion  techniques. 

To  fulfill  the  first  objective,  we  built  an  acoustic  speech 
recognizer  as  the  baseline  system.  The  recognizer  was  trained 
using  clean  speech.  Noisy  condition  of  a  particular  SNR  level 
was  simulated  by  adding  white  Gaussian  noise  to  the  clean 
speech  samples.  The  same  acoustic  feature  sets  were  also  used 
in  the  audio  channel  of  the  bimodal  system.  However,  it  is  as¬ 
sumed  that  visual  channel  is  not  affected  by  any  additional  noise 
during  testing.  A  visual-only  recognizer  was  built  and  used  as  a 
benchmark.  To  achieve  the  second  objective,  we  implemented  a 
common  form  of  the  early  integration  approach,  i.e.  fusion  by 
concatenating  the  audio  and  visual  feature  vectors.  The  systems 
were  developed  using  HTK. 

Evaluation  of  the  bimodal  speech  recognition  system  was 
performed  on  an  audio-visual  speech  dataset  [1]  collected  by 
Chen  el  al.  at  the  Camegie  Mellon  University.  The  vocabulary 
consists  of  78  words  commonly  used  in  scheduling  applications. 
The  visual  features  were  derived  from  the  lip-tracking  data  pro¬ 
vided  with  the  bimodal  speech  dataset.  The  primary  visual  fea¬ 
tures  considered  in  the  experiments  are  composed  of  ,  h2  , 
which  measure  the  vertical  openings  of  the  upper  and  lower  lips, 
and  the  distance  between  the  two  mouth-comers,  w.  Delta  fea¬ 
tures  were  also  included,  thus  the  actual  visual  feature  vector  is 
six-dimensional.  The  acoustic  speech  was  processed  using  a 
25ms  Hamming  window,  with  the  frame  period  set  at  10ms.  For 
each  frame,  12  MFCC  coefficients  were  calculated  from  the 
result  of  filterbank  analysis  using  26  channels.  Delta  coefficients 
were  also  computed  and  then  appended  to  the  static  features 
resulting  in  a  24-dumentional  acoustic  feature  vector. 

We  constructed  the  acoustic  and  the  audio-visual  speech 
models  at  the  word  level.  The  audio-only  system  is  based  on 
HMMs  with  nine  states,  left-to-right  topology,  and  no  skips. 
The  HMMs  used  in  the  visual-only  system  have  a  similar  topol¬ 
ogy,  but  with  only  five  states.  HMM  configuration  identical  to 
the  audio-only  system  is  used  in  the  early  integration  bimodal 
system.  The  CHMM-based  bimodal  system  uses  five  states  to 
model  the  audio  channel  and  three  states  for  the  visual  channel. 
The  allowed  state  transitions  follow  the  policy  specified  in  equa¬ 
tion  (1).  Recognition  was  performed  in  the  connected-word 
mode  without  the  help  of  any  grammatical  constrains.  A  cross- 
validation  scheme  was  used  in  the  evaluations  due  to  the  limited 
amount  of  data.  Specifically,  the  recognizers  were  trained  on  a 
subset  containing  90%  of  the  available  data  and  tested  on  the 
remaining  10%;  this  process  was  repeated  until  all  data  had  been 
covered  in  testing.  The  results  are  summarized  in  Table  1 . 

In  the  recognition  results,  it  is  evident  that  both  of  the  bi¬ 
modal  systems  demonstrate  improved  noise  robustness  in  com¬ 
parison  to  the  audio-only  system.  However,  at  lOdB,  the  gain  in 
robustness  achieved  by  the  early  integration  system  is  very  lim- 
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Figure  5.  Forced  alignment  using  audio  only  HMM  and 
audio-visual  CHMM 

ited.  On  the  other  hand,  the  CHMM  approach  managed  to  give  a 
clear  improvement  in  performance  at  the  same  SNR  level.  At  the 
30dB,  which  is  the  SNR  of  the  clean  speech  data,  the  recognition 
accuracy  of  the  CHMM-based  system  is  slightly  worse  than  both 
the  audio-only  recognizer  and  the  early  integration  bimodal  sys¬ 
tem. 

Table  1.  Summary  of  recognition  results  (measured  in  %word 
accuracy).  ‘A’  indicates  the  audio-only  system;  ‘V’  indicates  the 
visual-only  system;  ‘A+V’  indicates  the  bimodal  system  using 
early  integration;  and  ‘CHMM’  indicates  the  CHMM-based 
system. _ 


SNR 

lOdB 

20dB 

30dB 

A 

4.03 

4.3.61 

99.10 

V 

42.95 

42.95 

42.95 

A+V 

10.58 

72.79 

99.74 

CHMM 

35.32 

86.58 

93.32 

An  important  cue  the  visual  modality  provides  in  bimodal 
speech  perception  is  the  information  about  boundary  locations  of 
the  speech  units  within  an  utterance.  It  would  be  interesting  to 
see  if  this  effect  can  be  observed  in  our  audio-visual  ASR  sys¬ 
tem.  We  computed  forced  alignment  of  a  speech  segment  in  the 
20  dB  test  set  using  both  the  acoustic  only  recognizer  and  the 
CHMM-based  bimodal  recognizer.  The  results  are  illustrated  in 
Figure  5. 

Figure  5  covers  a  10-second  segment  of  the  alignment  re¬ 
sult.  The  two  subplots  on  the  bottom  show  the  word  boundaries 


superimposed  with  the  speech  waveform.  The  upper  one  is  the 
alignment  obtained  using  audio-visual  CHMMs;  the  lower  one 
shows  the  alignment  obtained  using  acoustic  only  HMMs.  The 
three  subplots  on  the  top  display  the  static  visual  features  used  in 
the  bimodal  system.  All  five  plots  are  time-aligned  so  that  the 
correspondence  among  them  can  be  visualized. 

From  the  plot,  we  see  that  the  audio-only  recognizer  almost 
always  give  the  incorrect  end-of-word  boundary  at  this  noise 
level.  In  contrast,  the  bimodal  system  was  able  to  precisely  de¬ 
termine  the  end  boundaries  in  6  out  of  7  cases.  It  is  interesting 
to  observe  that  the  bimodal  recognizer  consistently  introduced  a 
lead-time  before  the  audible  starting  point  of  a  word.  This  ob¬ 
servation  is  consistent  with  the  finding  from  human  speech 
perception,  that  the  visual  speech  usually  leads  the  visual  speech 
by  a  varying  time  window.  The  duration  of  the  visual  lead-in 
shown  in  Figure  5  ranges  from  about  40ms  to  150ms. 

5.  CONCLUSIONS 

We  have  described  a  novel  sensory  fusion  architecture  based  on 
the  CHMMs.  A  model  transformation  strategy  that  maps  the 
slate  space  of  a  CHMM  onto  the  state  space  of  a  classic  HMM  is 
proposed  to  carry  out  inference  and  learning.  Bimodal  speech 
recognition  experiments  demonstrate  that  the  CHMM-based 
fusion  scheme  can  utilize  the  information  in  the  visual  channel 
effectively  in  noisy  conditions. 
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Airborne  Acoustic  Microphones 

•  Handheld  microphones  (Shure,  etc) 

•  Headsets  (Knowles,  Shure,  Telex,  etc) 

-  Noise  canceling,  close  talking 

•  Super-directional  microphones  (Telex,  etc) 

-  Narrow  band  through  beam  forming 

-  Linear  arrays  in  a  reinforcing  pattern 
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Contact  Acoustic  Microphones 

•  Throat  microphones  (TEA,  Genesys,  Temco) 

-  ARL  Physiological  Microphone 

•  32dB  noise  rejection 

•  Acoustic  response  differs  from  a  regular  microphone 

•  Ear  microphones  (Jabra,  Temco) 

-  Some  ear  microphones  are  bone  conduction 

•  See  next  slide 
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Bone  Conduction  Microphones 

•  Navy  bone  conduction  microphone 

•  Ear  mounted  bone  conduction  microphone 

-  Invisio  (TEA) 

•  Top  of  head  bone  conduction  (Temco) 

•  Tooth  mounted  bone  conduction 
microphone 

-  Developed  through  a  SBIR  at  CECOM 
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Other  Alternative  Speech  Sensors 

•  Glottal  Electromagnetic  Micropower  Sensor 
(GEMS) 

-  Developed  at  Lawrence  Livermore  Nat.  Labs 

-  Commercial  developer  Aliph 

-  Uses  RADAR  to  measure  internal  motion 

-  Reduced  bandwidth 

•  Lip  reading  system  (camera/computer) 

-  Provides  limited  information,  not  a  speech  signal 

-  Robust  to  noise 
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Possible  Sensor  Fusion  Methods 

•  Combine  signals  from  multiple  sensors  in  a 
cooperative  fashion 

-  Some  non-standard  speech  sensors  capture 
speech  data  while  minimizing  noise,  but  do  not 
detect  the  full  bandwidth  of  the  speech  signal 

-  Could  extract  the  cleanest  spectral  components 
of  each  sensor  for  input  to  ASR  software 
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Possible  Sensor  Fusion  Methods 

•  Use  “clean  speech”  from  noise  robust 
sensors  to  remove  noise  from  a  primary 
sensor  (airborne  microphone) 

-  Difference  in  secondary  sensor  signals  and 
primary  sensor  signal  is  the  noise  (in  the 
acoustic  bands  covered  by  the  secondary 
sensors) 

-  Could  use  correlation  to  remove  noise  that 
extends  beyond  the  signal  range  of  the  sensor 
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Alternative  Concept 

•  Work  to  improve  a  non-standard  speech 
sensor  and  a  matched  ASR  system  to  provide 
an  integrated  speech-in-noise  package 

-  Need  a  sensor  with  good  noise  rejection  and 
“sufficient”  signal  capture  capability 

-  Need  to  tune  the  ASR  engine  to  the  peculiarities 
of  the  alternative  speech  sensor 
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Military  Requirements 

Different  for  each  application 

-  Just  like  in  the  commercial  world 
Selection  of  domain  can  be  used  to  limit  the 
problem 

-  Command  and  control  (C2)  domain 

•  Vocabulary  of  I-5K.  words 

•  Typically  command  phrases 

•  Limited  perplexity 
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Conclusion 


•  There  are  a  wide  variety  of  alternative 
speech  sensors  available  for  exploitation  for 
SR  in  militaiy  applications 

•  While  many  of  these  sensors  do  not  detect 
the  full  range  of  human  speech,  their 
intrinsic  noise  rejection  makes  them  useful 

•  Combinations  of  these  alternative  sensors 
may  provide  good  solutions  for  the 
application  of  speech  recognition  in  military 
environments 
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Abstract 

We  present  our  findings  from  audio-visual  speech 
recognition  experiments  for  connected  digit  recognition  in 
noisy  environments.  We  derive  hybrid  (geometric-  and 
appearance-based)  visual  lip  features  using  a  real-time  lip 
tracking  algorithm  that  we  proposed  previously.  Using  a 
small  single-speaker  corpus  modeled  after  the  TIDJGITS 
database,  we  build  whole-word  HMMs  using  both  single¬ 
stream  and  2-stream  modeling  strategies.  For  the  2- 
stream  HMM  method,  we  use  stream-dependent  weights  to 
adjust  the  relative  contributions  of  the  two  feature  streams 
based  on  the  acoustic  SNR  level.  The  2-stream  HMM 
consistently  gave  the  lowest  WER,  with  an  error  reduction 
of  83%  at  -3dB  SNR  level  compared  to  the  acoustic-only 
baseline.  Visual-only  ASR  WER  at  6.85%  was  also 
achieved.  A  real-time  system  prototype  was  developed  for 
concept  demonstration. 

1.  Introduction. 

By  combining  acoustic  and  visual  lip  features  for  speech 
recognition,  the  resulting  bimodal  speech  recognizer  is 
markedly  more  robust  in  the  presence  of  a  variety  of 
acoustic  noise,  when  compared  to  the  acoustic-only 
counterpart.  The  idea  was  pursued  in  a  number  of  past 
studies  [2][5][6][7][8][12][13][14][15][16][17][21].  Two 
key  elements  of  an  audio-visual  speech  recognition  system 
are:  (1)  a  front  end  for  visual  feature  extraction,  and  (2)  an 
information  fusion  architecture  for  integrating  features 
from  the  two  modalities.  In  recent  years,  considerable 
progress  has  been  made  in  the  first  area  [4][13][15][16],  as 
well  as  in  the  second  area  [6][8][14][15][17]. 

There  are  primarily  two  categories  of  visual  feature 
representation  in  the  context  of  speech  recognition.  The 
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first  is  model-based  or  geometric-based.  Examples  of  such 
features  are  the  width  and  height  of  the  mouth  (and  their 
temporal  derivatives)  that  can  be  estimated  from  the 
images  using  a  tracking  procedure.  The  second  category  is 
pixel-based  or  appearance-based;  that  is,  the  features  are 
directly  derived  from  the  raw  pixel  values.  The  first 
category  is  more  intuitive,  but  there  is  typically  a 
substantial  loss  of  information  because  of  the  data 
reduction  involved.  There  is  little  loss  of  information  in 
the  second  representation,  but  the  high  dimensionality  of 
the  image  space  is  a  computational  disadvantage,  and 
pixel-based  features  do  not  directly  relate  to  observable 
articulator  motion.  Furthermore,  normalization  needed  to 
account  for  lighting  changes,  translation  and  other  effects 
is  more  difficult  compared  to  the  geometric-based 
counterpart. 

We  had  experimented  with  a  visual  feature  representation 
that  combined  the  two  types  of  features  in  our  previous 
work  and  demonstrated  its  effectiveness  in  simple  isolated 
digit  recognition  experiments  [4].  The  technique  is 
adopted  in  the  work  reported  in  this  paper.  Here  we 
develop  new  experiments  to  evaluate  our  system  using 
stream-weighted  2-stream  Hidden  Markov  Models 
(HMMs)  as  well  as  the  traditional  single  stream  HMMs  in 
the  context  of  connected  digit  recognition. 

The  rest  of  the  paper  is  organized  as  follows.  We  first 
briefly  describe  our  lip  localization  and  tracking 
algorithms  that  allow  geometric-based  features  to  be 
extracted  automatically,  and  pixel-based  features  to  be 
subsequently  normalized.  We  then  focus  on  the  proposed 
hybrid  feature  and  its  efficacy  in  the  context  of  visual-only 
speech  recognition.  Finally,  we  describe  the  recognition 
experiments  we  performed,  and  report  our  findings  from 
these  experiments  involving  audio-visual  speech 
recognition  of  connected  digits  in  the  presence  of  aircraft 
cockpit  noise  of  varying  SNR  levels. 


27 


2.  Visual  Tracking  and  Localization. 

To  automate  machine  lipreading,  we  need  to  locate  and 
track  movements  and  appearance  changes  of  the  lips. 
Several  model-based  approaches  for  tracking  lip 
movements  that  have  been  proposed  include  snake  models 
[10],  deformable  templates  [20],  active  shape  models  [12], 
and  active  contours  [11],  We  have  developed  an  integrated 
approach  addressing  both  lip  localization  and  lip  tracking 

[2] [3].  The  first  part  is  based  on  Gaussian  mixture  model- 
based  clustering  using  hue  in  the  HSV  color  space.  The 
largest  elliptical  connected  region  detected  with  the 
expected  range  of  hue  values  is  identified  as  the  lips.  It  is 
usually  quite  effective  and  can  be  used  to  initialize  the  lip 
tracking  part.  Tracking  is  based  on  a  user-specific  2D  B- 
spline  model  that  can  be  constructed  offline,  or  estimated 
from  sample  images  [3],  To  optimize  tracking  stability,  the 
model  deforms  only  in  an  affine  subspace,  which  is 
adequate  for  capturing  most  lip  movements  that  occur  in 
normal  speech  utterances.  The  model  is  driven  (or  fitted) 
based  on  locations  of  steepest  gradient  in  the  image,  in  a 
linearly  transformed  color  space  given  by 

s=  a-r  +  p-g  +  y-b, 

where  {a,  /?,  y}  are  speaker-dependent  and  are  estimated 
based  on  linear  discriminant  analysis  on  the  RGB  content 

[3] .  This  overcomes  problems  associated  with  often  fuzzy 
definition  of  lip  boundary  in  the  luminance  channel,  and 
the  algorithm  is  consequently  markedly  more  robust 
compared  to  most  snake-based  algorithms  and  other 
approaches  based  on  grayscale  information  alone.  Another 
unique  element  is  that  the  residual  fitting  error  is  used  to 
monitor  tracking  errors  and  outlier  measurements,  and  can 
trigger  the  lip  localization  module  for  automatic  re¬ 
initialization.  We  have  implemented  a  real-time  tracking 
system  on  a  195MHz  SGI  02  workstation  that  runs  at 
30fps.  Figure  1  shows  a  few  tracking  examples. 

3.  Hybrid  Visual  Features. 

Hybrid  features  are  comprised  of  both  geometric-  and 
pixel-based  features.  Using  tracking  results  obtained  from 
the  algorithm  described  above,  geometric-based  features, 
including  the  width  and  height  of  the  mouth  area  and  their 
temporal  derivatives,  can  be  estimated  automatically. 
Pixel-based  features  are  derived  from  the  vertical  intensity 
profile  calculated  based  on  a  subset  of  the  pixels, 
delimited  by  the  boundary  of  the  upper  and  lower  lips 
explicitly  estimated  by  the  tracking  algorithm.  The  number 
of  pixels  that  defines  the  profile  varies  over  time  as  the 
lips  open  and  close.  By  proper  sub-sampling  and  linear 
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Figure  1 :  Snapshots  of  output  from  our  lip  tracking  and 
visual  feature  extraction  system  in  a  few  video  frames. 
Geometric-based  features  were  extracted  from  the 
tracking  contour.  Normalized  pixel-based  features 
were  calculated  based  on  the  vertical  intensity  profile 
in  the  middle  mouth  region  (plotted  horizontally  in 
light  blue  against  a  vertical  axis). 

interpolation,  we  map  the  vertical  profile  to  a  feature 
vector  of  constant  length  (e.g.,  32  in  our  experiments). 
Therefore,  information  about  the  height  of  the  mouth  is 
largely  decoupled  from  the  pixel-based  features.  This  is  in 
contrast  to  cropping  a  rectangular  region  in  the  image  that 
encompasses  the  lips  in  a  sequence  of  image  frames  in  an 
utterance,  and  subsequently  taking  the  central  vertical 
profile  as  the  ROI.  In  practice,  the  ROI  consists  of  a  thin 
strip  of  pixels,  where  smoothing  in  the  orthogonal 
direction  is  performed. 

Robustness  of  ROI  estimation  for  pixel-based  features  and 
the  accuracy  of  tracking  are  known  to  be  important  for 
improving  accuracy  of  visual  speech  recognition  [9][13]. 
The  approach  we  proposed  could  also  be  applied  to  the 
whole  ROI  defined  by  the  tracking  contour  as  opposed  to 
only  to  the  vertical  profile.  Furthermore,  transform-based 
features  similar  to  that  in  [15]  could  also  be  derived  and 
used  as  features  instead.  Comparison  with  these  variants 
will  be  a  subject  of  future  study.  In  our  experiments,  the 
center  profile  contained  much  of  the  information  about  the 
appearance  of  the  teeth  and  tongue,  as  well  as  their  spatial 
relationship,  and  good  recognition  accuracy  was 
achievable  even  in  visual-only  speech  recognition. 

Figure  1  illustrates  the  application  of  the  tracking 
algorithm  for  the  extraction  of  visual  features  (both 
geometric-  and  pixel-based). 

4.  HMM  for  Audio-Visual  Speech. 
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Here  we  describe  the  basic  elements  of  the  HMMs  in  our 
approach. 

An  N-state  HMM  is  characterized  by  a  state  transition 
matrix,  {a(J},\<i,j  <N ,  and  a  set  continuous 

observation  density  functions,  one  for  each  state,  which 
can  be  written  as  a  Gaussian  mixture 

M 

bj  (o, )  =  £  ),  1  <  j  <  N  , 

m=  1 

where  Ot  is  the  observation  vector  at  time  /,  Cjm  is  the 
mixture  coefficient,  6  is  a  multi-variate  Gaussian 
distribution  with  mean  jljm  and  covariance  Vjm  for  wth 
mixture  in  the  state  j. 

The  acoustic  and  visual  features  were  combined  in  two 
different  ways  in  our  HMM-based  ASR  experiments.  In 
the  first  scheme,  acoustic  and  visual  feature  vectors  are 
concatenated  to  form  individual  feature  vectors.  In  the 
second  scheme,  we  model  acoustic  and  visual  features  in 
separate  feature  streams.  The  mixture  weights,  mean 
vectors  and  covariance  matrices  in  each  observation 
density  function  are  modeled  separately  in  individual 
streams.  The  corresponding  observation  density  is  given 
by 
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where  subscripts  a  and  v  are  used  to  denote  the  audio  and 
visual  channels,  and  the  density  of  each  channel  is 
weighted  by  exponents  (ia  and  /3V  respectively,  where  [3a 

+  =  1 .  This  is  the  multi-stream  HMM  formulation.  The 

implicit  assumption  is  that  the  audio  and  video 
observations  are  independent,  which  is  really  not  exactly 
accurate.  However,  to  be  able  to  estimate  reliably  the 

parameters  of  6,  from  limited  amount  of  training  data,  it 

is  customary  to  assume  a  diagonal  covariance,  and  hence 
the  assumption  can  be  applied  justifiably  at  least  in  the 
single  Gaussian  case  with  equal  stream  weights. 
Empirically,  the  stream  weights  can  be  used  to  give 
different  emphasis  to  the  observations,  for  example,  based 
on  the  relative  reliability  of  each  channel. 

5.  Speech  Recognition  Experiments. 

We  performed  a  few  evaluation  experiments  to  compare 
various  visual  feature  choices  and  investigate  the  relative 
merits  of  the  various  possible  feature  combinations.  We 
focused  on  the  connected  digit  recognition  task.  The 


Table  I:  Visual-only  connected  digit  ASR’s  word 
error  rate  (WER  %)  for  geometric  (G),  pixel-based 
(P),  and  hybrid  (G+P)  features  described  in  this 
paper.  The  second  and  third  rows  are  results  with 
delta  and  delta-delta  features.  The  size  of  the  base 
feature  vector  is  indicated  in  parentheses. 


G(2) 

P(32) 

G(2)+P(32) 

Static 

36.89 

22.66 

20.29 

Static+A 

26.88 

11.59 

9.88 

Static+A+AA 

27.80 

9.49 

6.85 

eleven  digits  were  0-9  and  ‘oh.’  The  digit  strings  were 
taken  from  TIDIGITS,  where  utterances  of  up  to  seven 
digits  were  used.  From  a  small  database  of  1518  audio¬ 
visual  speech  utterances,  759  were  used  for  training  and 
759  for  testing.  Speech  samples  from  one  speaker  were 
used  to  isolate  the  effects  of  speaker  variability  in  this 
particular  study.  We  used  Hidden  Markov  Models  to  build 
word-model  based  recognizers.  Gaussian  mixtures  were 
used  to  model  the  observation  densities.  The  optimal 
number  of  mixtures  (1-10)  and  number  of  hidden  states  (5- 
10)  in  the  HMMs  were  determined  empirically.  A  3-state 
silence  model  was  also  used.  The  acoustic  features  were 
12  Mel  frequency  cepstral  coefficients  (MFCC)  plus  the 
0th  order  cepstral  coefficient,  as  well  as  their  first  and 
second  temporal  derivatives,  resulting  in  an  acoustic 
feature  vector  of  size  39.  They  were  computed  every  1 0ms 
using  a  25ms  frame  analysis  window.  Per-utterance 
cepstral  mean  normalization  was  also  applied. 

The  geometric  features  were  derived  from  the  width  and 
height  of  the  mouth  normalized  with  respect  to  the 
corresponding  dimensions  when  the  speaker’s  mouth  was 
closed.  The  pixel-based  features  were  also  normalized 
with  respect  to  the  mean  value  of  the  vertical  profile  when 
the  speaker’s  mouth  was  closed.  Interpolation  of  visual 
features  was  performed  to  generate  samples  at  the  audio 
feature  frame  rate  of  1 00Hz. 

In  the  audio-visual  experiments,  the  audio  features  and 
visual  features  were  concatenated  to  form  a  single  feature 
vector  for  the  single  stream  HMM  case.  The  2-stream 
HMM  was  also  considered  where  the  stream  exponents 
were  optimized  using  a  linear  step  search.  Alternatively, 
they  could  be  discriminatively  trained  [17],  The  Baum- 
Welch  algorithm  was  used  for  EM-style  embedded  HMM 
training,  and  the  Viterbi  decoding  algorithm  for 
recognition.  The  HTK  Toolkit  [19]  was  used  to  design 
these  experiments. 

Table  1  shows  first  a  summary  of  the  recognition 
experiments  employing  visual  features  alone.  One  general 
trend  we  observed  was  that  dynamic  features  (delta  and 
delta-delta)  in  general  carry  additional  information  for 
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Table  2:  Recognition  WER  (%)  for  the  audio-only 
baseline  (A),  visual-only  baseline  (V),  single  stream 
audio-visual  (AVI),  2-stream  audio-visual  (AV2)  ASR  at 
different  SNR  levels  (dB).  The  reference  visual  feature 
used  here  was  G+AG+P.  pa  is  the  optimal  stream  weight 

on  the  audio  channel  for  AV2.  Note  that  AVI  was  worse 
than  the  visual-only  ASR  at  -3dB,  whereas  AV2 
remained  better. 


clean 

20 

15 

10 

5 

3 

0 

-3 

A 

0.13 

0.66 

5.53 

23.58 

67.19 

75.63 

80.11 

85.11 

V 

17.26 

17.26 

17.26 

17.26 

17.26 

17.26 

17.26 

17.26 

AVI 

0.13 

0.53 

1.32 

2.50 

7.38 

10.14 

15.55 

22.79 

AV2 

0.13 

0.26 

0.53 

2.50 

6.59 

9.75 

12.12 

14.49 

A, 

0.95 

0.85 

0.8 

0.65 

0.5 

0.45 

0.35 

0.35 

recognition.  Visual-only  ASR  word  error  rate  as  good  as 
6.85%  was  achieved,  which  was  remarkable  since  no 
acoustic  information  was  used  and  the  pixel-based  features 
were  derived  only  from  a  small  subset  of  pixels. 

In  the  second  experiment,  we  evaluated  the  effectiveness 
of  the  hybrid  feature  in  the  context  of  audio-visual  speech 
recognition  in  the  presence  of  noise.  To  be  consistent  with 
the  visual  features  used  in  our  previous  work  [4],  the 
hybrid  features  employed  were  the  combination  of  the 
base  static  pixel-based  features,  and  the  width  and  height 
of  the  mouth  together  with  their  first  temporal  derivatives 
(i.e.,  G+AG+P).  We  added  F-16  cockpit  noise  (from  the 
NoiseX  database)  to  the  audio  channel  systematically  at 
various  SNR  levels  (20dB  to  -3dB)  only  to  the  testing 
data.  Table  2  summarizes  the  results.  We  observe  that  the 
bimodal  recognizers  consistently  outperformed  the  audio- 
only  counterpart  at  all  SNR  levels.  Furthermore,  the  2- 
stream  HMM  outperformed  the  single-stream  HMM,  and 
the  performance  difference  increased  as  the  SNR 
decreased.  That  was  possible  because  the  2-stream  HMM 
allowed  stream  weights  to  be  applied  selectively  based  on 
reliability  of  the  acoustic  features.  In  fact,  the  optimal 
stream  weight  on  the  audio  channel  decreased 
monotonically  with  the  SNR  level.  We  expect  the  overall 
performance  will  be  higher  if  we  use  all  delta  and  delta- 
delta  visual  features. 

Figure  2  shows  a  screenshot  of  the  tracking  and  audio¬ 
visual  ASR  system  prototype  that  we  have  developed  for 
experimentation. 

6.  Conclusion. 

We  overviewed  a  real-time  visual  lip  tracking  system  that 
we  used  to  define  the  ROI  for  visual  feature  calculation. 


We  demonstrated  the  efficacy  of  our  hybrid  visual  features 
in  the  context  of  connected  digit  recognition.  Although 
single  stream  audio-visual  HMM  using  concatenated 
features  outperformed  the  acoustic-only  counterpart,  the  2- 
stream  HMM  gave  the  lowest  WER  at  all  SNR  levels.  The 
optimal  stream  weight  for  the  audio  channel  decreased  as 
the  SNR  level  was  lowered. 
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Abstract 

Multimodal  dialog  systems  research  at  the  University 
of  Illinois  seeks  to  develop  algorithms  and  systems 
capable  of  robustly  extracting  and  adaptively  com¬ 
bining  information  about  the  speech  and  gestures  of 
a  nave  user  in  a  noisy  environment  This  paper  will 
review  our  recent  work  in  seven  fields  related  to  mul¬ 
timodal  semantic  understanding  of  speech:  audiovi¬ 
sual  speech  recognition,  multimodal  user  state  recog¬ 
nition,  gesture  recognition,  face  tracking,  binaural 
hearing,  noise- robust  and  high-performance  acoustic 
feature  design,  and  recognition  of  prosody 

1  Introduction 

The  purpose  of  this  paper  is  to  summarize  ongoing 
multimodal  speech  and  dialog  recognition  research 
at  the  University  of  Illinois.  A  multimodal  speech 
recognition  system  can  be  described  in  two  distinct 
stages:  (1)  robust  audiovisual  feature  extraction,  and 
(2)  speech  and  user  state  recognition  using  dynamic 
Bayesian  networks.  Features  are  extracted  from  au¬ 
diovisual  input  in  order  bo  optimally  represent  pho¬ 
netic,  visemic,  gestural,  and  prosodic  information. 
Our  specific  ongoing  research  projects  include  bin¬ 
aural  hearing  (array  processing  on  a  mobile  plat¬ 
form),  bio  mimetic  noise- robust  acoustic  feature  ex¬ 
traction,  maximum  mutual  information  acoustic  fea¬ 
ture  design,  and  face  tracking.  Customized  Dynamic 
Bayesian  networks  have  been  designed  for  three  dif¬ 
ferent  recognition  tasks:  audiovisual  speech  recog¬ 
nition  using  coupled  HMMs,  usa1  state  recognition 
using  hierarchical  HMMs,  and  recognition  of  speak¬ 
ing  rate  using  hidden-mode  explicit-duration  acoustic 
HMMs. 

Image  and  Speech  Processing  research  at  the  Uni¬ 
versity  of  Illinois  is  currently  tested  in  two  ongoing 
research  prototype  environments  The  first  research 
prototype  environment  is  an  experimental  computing 
facility  for  teaching  children  about  physics.  The  sec¬ 


ond  research  environment  is  an  autonomous  robot, 
Illy,  who  acquires  language  through  the  semantic  as¬ 
sociation  of  audio,  visual,  and  haptic  sensory  data 
Prior  to  implementation  on  one  or  both  of  these  plat¬ 
forms,  most  of  our  algorithms  are  tested  using  stan¬ 
dard  or  locally  acquired  datasets. 

2  Pre-Processing 

2.1  Binaural  Hearing 

Our  research  on  binaural  hearing  addresses  the  ex¬ 
traction  of  noise-robust  audio  from  a  two- microphone 
array  mounted  on  a  physically  mobile  platform  (a 
language-learning  autonomous  robot).  The  source 
localization  algorithm  is  based  on  a  two  channel 
Griffiths-Jim  beamformar  [3]  and  a  new  phase  un¬ 
wrapping  algorithm  for  accurate  estimation  of  time 
difference  of  arrival  measures  [8],  The  new  phase  un¬ 
wrapping  algorithm  is  trained  using  many  measure¬ 
ments  of  TD  OAs  in  order  to  create  an  accurate  spa^ 
tial  map  of  TDOA  pattern  as  a  function  of  arrival 
azimuth  and  elevation.  These  can  then  be  used  both 
to  cancel  interfering  noise  and  to  get  a  faithful  rep¬ 
resentation  of  the  desired  speech  signal.  Preliminary 
results  show  that  a  speech  signal  can  be  accurately 
located  in  noisy  laboratory  room  within  a  few  mil¬ 
liseconds  and  with  ten  degree  accuracy  at  a  distance 
of  2-4  meters  (acoustic  fax  field). 

In  the  current  implementation,  detection  of  a 
speech  signal  triggers  physical  rotation  of  the  receiver 
platform  (the  robot’s  “head”)  so  that  it  faces  the  pri¬ 
mary  talker.  By  physically  aligning  the  “head”  of  the 
robot  with  the  direction  of  primary  source  arrival,  we 
are  able  to  use  extremely  efficient  off-  axis  cancellation 
algorithms  for  improved  SMR  [9j. 

2.2  Acoustic  Features 

Standard  speech  recognition  features  (including 
MFCC,  PIP,  and  LPCC)  result  in  isolated  digit 
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Figure  1:  WER:  isolated  digit  recognition  in  white 
noise  with  two  standard  feature  sets,  MFCC  and 
LPCC,  and  two  now!  feature  sets,  LPCC  with  voice 
index  and  with  frame  index  (from  [6]). 

recognition  error  rates  of  approximately  60%  at  lOdB 
SNR,  and  nearly  80%  at  OdB  SNR.  In  1992,  Med- 
dis  and  Hewitt  proposed  a  biomimetic  method  for 
recognition  of  voiced  speech  in  high  noise  environ¬ 
ments  [10] .  Meddis  and  Hewitt  proposed  Sitting 
a  noisy  speech  signal  into  many  bands,  computing 
the  autocorrelation  function  (t)  in  each  sub-band, 
and  then  estimating  the  speech  autocorrelation  R(r) 
by  optimally  selecting  and  adding  bogetho'  the  high- 
SNR  sub-band  autocorrelations.  In  our  work  [6],  we 
have  replaced  Meddis  and  Hewitt’s  optimal  selection 
algorithm  by  an  optimal  scaling  algorithm.  Specifi¬ 
cally,  we  estimate  the  sub-band  SNR  vk  using  a  stan¬ 
dard  pitch  prediction  coefficient,  i.e. 

Speech  Energy  in  Band  lc  ^  J? it (lo)  .... 

h  Tbtal  Energy  in  Band  k  (0)  ' 

where  T0  is  the  globally  optimum  pitch  period.  The 
maximum  likelihood  estimate  of  the  noise-free  speech 
signal  autocorrelation  is  then 

=  (2) 

k 

In  isolated  digit  recognition  experiments,  the  use  of 
equations  1  and  2  reduced  word  error  rate  by  more 
than  a  factor  of  three  in  white  noise  at  lOdB  through 
-lOdB,  and  by  more  than  a  factor  of  two  in  babble 
noise  at  the  same  SNRs  (Figure  1). 

The  phonological  features  implemented  at  a  speech 
landmark  influence  the  acoustic  spectrum  at  dis¬ 
tances  of  50-1 00ms  [4,  19].  Complete  representation 
of  a  100ms  spectrogram  requires  a  120-dimensional 
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IkbLe  1:  Phonemerecognition  correctness  in  four  con¬ 
ditions.  Features  selected  using  a  maximum  mutual 
information  criterion  (MMIA)  provide  superior  per¬ 
formance  in  all  four  conditions. 

acoustic  feature  vector.  It  is  not  possible  to  accu¬ 
rately  train  observation  PDFs  of  dimension  120  using 
existing  data  sets,  but  it  is  possible  bo  select  a  sub¬ 
vector  using  a  quantitative  optimality  criterion.  In 
our  research,  we  select  a  3  9- dimensional  feature  sub- 
vector  from  a  list  of  160  candidate  features  in  order 
bo  optimize  the  mutual  information  between  features 
and  phoneme  labels  [12).  Optimality  is  determined 
using  a  clean  speech  database  (TIMIT)  with  no  lan¬ 
guage  model,  but  the  resulting  optimality  generalizes. 
As  shown  in  Table  1,  the  resulting  MMIA  (maximum 
mutual  information  acoustic)  feature  vector  outper¬ 
forms  all  standard  feature  vectors  under  at  least  three 
conditions:  in  quiet  and  at  10 dB  SNR,  without  alan- 
gu  age  mo  del  an  d  with  an  optimized  phoneme  bigr  am. 
Larger  improvements  may  be  obtained  by  testing  the 
5-10  best  feature  vectors  generated  during  the  mutual 
information  search.  The  best  recognition  accuracy, 
obtained  using  the  feature  set  with  secon  d-best  mu¬ 
tual  information,  was  62%  with  no  language  model 
in  quiet  conditions. 

2.3  Face  Tracking 

Research  has  shown  that  facial  and  vocal-tract  mo¬ 
tions  are  highly  correlated  during  speech  produc¬ 
tion  [20].  Speech  recognition  using  both  audio /visual 
features  is  shown  bo  be  more  robust  in  noisy  environ¬ 
ments  [5],  Analysis  of  non-rigid  human  facial  motion 
is  a  key  component  for  acquiring  visual  features  for 
audio/ visual  speech  recognition. 

In  the  past  several  years,  research  in  our  group  has 
led  to  a  robust  3D  facial  motion  tracking  system  [16]. 
A  3D  non-rigid  facial  motion  modd.  is  manually  con¬ 
structed  based  on  piecewise  Bezier  volume  deforms 
tion  model  (PBVD).  It  is  used  bo  constrain  the  noisy 
low-level  optic  al  flow .  The  tr  aclcing  is  done  in  a  multi- 
resolution  manner  such  that  higher  speed  could  be 
achieved.  It  runs  at  5  fps  an  an  SGI  Onyx2  machine. 
This  tracking  algorithm  has  been  successfully  used  for 
audio-visual  speech  recognition  and  bimodal  emotion 
recognition. 
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Figure  2:  Demonstration  of  our  fare  tracking  system. 

2.4  Gesture  Recognition 

Hand  gestures  are  capable  of  delivering  information 
not  presented  in  speech  [14].  Controlling  gesture  can 
be  used  bo  provide  commands  bo  the  system.  Nav¬ 
igation  gestures  provide  information  for  manipulat¬ 
ing  virtual  objects,  and  far  selecting  point  objects  or 
large  regions  on  the  screen.  Conversational  gestures 
provide  subtle  cues  bo  sentence  meaning  in  normal 
human  interaction.  Automated  hand  tracking  and 
gesture  recognition  can  help  improve  the  performance 
of  human-machine  interface. 

We  have  investigated  both  appearance-based  ges¬ 
ture  recognition  (using  neural  network-based  pat¬ 
tern  recognition  techniques)  and  model-based  gesture 
recognition  [18,  17],  In  model-based  recognition,  the 
configuration  of  a  hand  model  is  first  determined  by 
providing  a  set  of  joint  angle  parameters.  The  2D 
projection  of  this  hand  model,  determined  by  the 
translation  and  orientation  of  the  model  relative  bo 
a  viewing  portal,  is  compared  with  the  hand  image 
from  input  video.  Estimate  of  the  correct  input  hand 
configuration  is  determined  by  the  best  matching  pro¬ 
jection.  A  complete  description  of  the  global  hand 
position  and  all  finger  joint  angles  requires  specific B/- 
tion  of  21  joint  angles.  Using  both  known  anatom¬ 
ical  constraints  and  PCA  to  reduce  dimensionality, 
we  can  initially  reduce  the  dimensionality  of  the  ges¬ 
tural  description  from  21  to  7  independent  dimen¬ 
sions  while  keeping  95%  of  the  information.  In  this 
7-dimensional  spare,  it  is  possible  to  define  28  ba¬ 
sis  configurations,  consisting  of  the  configurations  in 
which  each  finga-  is  either  fully  folded  or  completely 
extended.  A  close  examination  of  the  motion  trajec¬ 
tories  between  these  basis  states  shows  that  natural 
hand  articulations  seem  bo  he  largely  in  the  linear 


manifold  spanned  by  pairs  of  basis  states.  We  be¬ 
lieve  that,  based  on  these  preliminary  results,  it  will 
be  possible  bo  map  all  observed  gestures  into  a  Low¬ 
dimensional  gestural  manifold,  resulting  in  efficient 
and  accurate  gesture  recognition. 

3  Dynamic  Bayesian  Networks 

3.1  Lip  Reading 

The  focus  of  our  research  in  lip  reading  is  a  novel  ap¬ 
proach  to  the  fusion  problem  in  audio-visual  speech 
processing  and  recognition.  Our  fusion  algorithm  is 
built  upon  the  framework  of  coupled  hidden  Markov 
models  (CHMMs).  CHMMs  are  probabilistic  in¬ 
ference  graphs  that  have  hidden  Markov  models 
(HMMs)  as  sub-graphs.  Chains  in  the  correspond¬ 
ing  inference  graph  are  coupled  through  matrices  of 
conditional  probabilities  modeling  temporal  depen¬ 
dencies  between  their  hidden  state  variables.  The 
cou piling  probabilities  are  both  cross  drain  and  cross 
time  The  later  is  essential  far  capturing  temporal  in¬ 
fluences  between  chains.  In  a  bimodal  speech  recog¬ 
nition  sysbem,  two-chain  CHMMs  are  deployed,  with 
one  chain  being  associated  with  the  acoustic  obser¬ 
vations,  the  other  with  the  visual  features.  Under 
this  framework,  the  fusion  of  the  two  modalities  takes 
place  during  the  classification  stage  The  particular 
topology  of  the  CHMM  ensures  that  the  learning  and 
classification  are  based  on  the  audio  and  visual  do¬ 
mains  jointly,  while  allowing  asynchronies  between 
the  two  information  channels. 

In  essence,  CHMMs  are  directed  graphical  models 
cf  stochastic  processes  and  are  a  special  type  of  Dy¬ 
namic  Bayesian  Networks  (DBNs).  The  DBNs  gen¬ 
eralize  the  HMMs  by  representing  the  hidden  stabes 
as  state  variables,  and  allow  the  stabes  to  have  com¬ 
plex  interdependencies.  The  DBN  point  of  view  fa¬ 
cilitates  the  development  of  inference  algorithms  for 
the  CHMMs.  Specifically,  two  inference  algorithms 
are  proposed  in  this  work.  Both  of  the  algorithms  are 
exact  metho  ds.  The  first  is  an  extension  of  the  well- 
known  forward-backward  algorithm  from  the  HMM 
literatures.  The  second  is  a  strategy  of  converting 
CHMMs  bo  mathematically  equivalent  HMMs,  and 
carrying  out  learning  in  the  transformed  models. 

The  benefits  of  the  proposed  fusion  scheme  are 
confirmed  by  a  series  of  preliminary  experiments 
on  audio-visual  speech  recognition.  Visual  fea^ 
tures  based  on  lip  geometry  are  used  in  the  exper¬ 
iments,  Furthermore,  comparing  with  an  acoustic- 
only  A  SR  sysbem  trained  using  only  the  audio  chan¬ 
nel  of  the  same  dataset,  the  bimodal  system  consis¬ 
tently  demonstrates  improved  noise  robustness  across 
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SNR 

lOdB 

20  dB 

30  dB 

A 

4.03 

43.61 

99.10 

V 

42.95 

42.95 

42.95 

A+V 

10.58 

72.79 

99.74 

CHMM 

35.32 

86.58 

93.32 

Acirvhy 

nfcxtefiug 


Table  2:  Result  of  exp  aliments  in  audiovisual  speech 
recognition  (measured  in  %word  accuracy).  A  indi¬ 
cates  the  audio-only  system,  V  indicates  the  visual- 
only  system,  A+V  indicates  a  bimodal  system  using 
early  integration;  and  CHMM  indicates  the  CHMM- 
based  system. 


a  wide  range  of  SNR.  Levels. 

3.2  Prosody 

Our  approach  to  the  recognition  of  prosody  is  the 
use  of  a  “hidden  mode  van  able”  [13]  to  condition  the 
explicit  duration  PDFs  of  a  CYDHMM  [7],  In  our 
prototype  algorithm,  the  state  space  consists  of  par¬ 
allel  phonetic  state  variables  (g*)  and  prosodic  state 
variables  (fc*).  The  dwell  time  of  state  qt  is  a  random 
variable  dq  with  PDF  depending  p(dg|g,fc).  At  the 
end  of  the  specified  dwell  time,  the  phonetic  variable 
always  changes  state  (no  self-loops),  but  the  prosodic 
state  variable  may  or  may  not  change  state.  Thus, 
for  example,  if  (fc*  eslow,  medium,  fast)  represents 
speaking  rate,  it  may  be  reasonable  to  allow  kt  to 
change  state  ah  any  word  boundary  with  a  small  prob¬ 
ability. 

In  order  to  allow  efficient  experiments,  we  have 
modified  HTK  to  make  use  of  Ferguson’s  EM  al¬ 
gorithm  for  explicit-duration  HMMs  [1,  2],  Fergu¬ 
son’s  algorithm  is  an  order  of  magnitude  faster  than 
most  algorithms  for  the  explicit-duration  HMMs. 
The  computational  complexity  of  the  algorithm  is 
C?(iVT(N  +  T)),  where  N  is  the  number  of  states, 
T  is  the  number  of  frames  in  the  input  signal,  and 
(i Q(N2T ))  is  the  complexity  of  an  HMM  without  ex¬ 
plicit  duration.  The  forward  algorithm  computes 

q-J^)  =  P{Oi, ...  ,Ot,j  commences  at  t  + 1) 

J 

a*(i)  =  P(Oi, . . .  ,Oiyi  ends  at  f) 

=  ai-d(i)Krfl*K0*-d+i.-»0*li) 

d 

3.3  User  State  Recognition 

Integration  of  a  large  number  of  sources  for  the  pur¬ 
pose  of  multimodal  user-state  recognition  can  be  ac¬ 
complished  using  a  Ilia: arclii cal  dynamic  Bayesian 
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Figure  3:  Architecture  for  detecting  events  in  the  of¬ 
fice  scenario 


network  (figure  3).  In  a  hierarchical  DBN,  each 
modality  (audio,  lip  reading,  gesture,  and  prosody) 
is  modeled  using  a  modality- dependent  HMM.  Each 
modality-dependent  HMM  is  searched  in  order  to 
generate  the  N  transcriptions  that  best  match  the 
observed  data  in  the  given  modality.  The  likelihood 
of  each  transcription  is  then  estimated  using  a  con¬ 
strained  forward -backward  algorithm,  generating  the 
probability  of  state  residency  during  every  frame. 
These  probabilities  are  fed  forward  to  the  supervisor 
HMM,  which  integrates  them  to  determine  a  single 
transcription  of  the  sentence  in  order  to  maximize  the 
a  posteriori  transcription  probability.  By  imposing  a 
prior  on  the  probability  distributions  learned  by  the 
model  for  the  purpose  of  increasing  conditional  en¬ 
tropy,  we  have  demonstrated  a  10%  increase  in  user 
state  classification  performance  [15,  11], 


4  Conclusions 


Our  research  is  intended  to  elucidate  both  the  the¬ 
oretical  and  the  practical  requirements  for  effective 
multimodal  speech  understanding  systems.  The  use 
of  speech  in  multimodal  systems  will  in  crease  our  the¬ 
oretical  understanding  of  the  problems  of  sensor  fu¬ 
sion  and  representations  of  multimodal  signals.  In¬ 
creased  theoretical  understanding,  in  turn,  will  en¬ 
able  us  to  produce  practical  results  that  can  be  di¬ 
rectly  used  in  state-of-the-art  speech  recognition  sys¬ 
tems  and  as  part  of  larger  systems  for  advanced 
human-machine  communication. 
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ABSTRACT 

There  have  been  higher  demands  recently  for  Automatic  Speech 
Recognition  (ASR)  systems  able  to  operate  robustly  in  acousti¬ 
cally  noisy  environments.  This  paper  proposes  a  method  to  ef¬ 
fectively  integrate  audio  and  visual  information  in  audio-visual 
(bi-modal)  ASR  systems.  Such  integration  inevitably  necessitates 
modeling  of  the  synchronization  and  asynchronization  of  the  au¬ 
dio  and  visual  information.  To  address  the  time  lag  and  correla¬ 
tion  problems  in  individual  features  between  speech  and  lip  move¬ 
ments,  we  introduce  a  type  of  integrated  HMM  modeling  of  audio¬ 
visual  information  based  on  a  family  of  a  product  HMM.  The  pro¬ 
posed  model  can  represent  state  synchronicity  not  only  within  a 
phoneme  but  also  between  phonemes.  Furthermore,  we  also  pro¬ 
pose  a  rapid  stream  weight  optimization  based  on  GPD  algorithm 
for  noisy  bi-modal  speech  recognition.  Evaluation  experiments 
show  that  the  proposed  method  improves  the  recognition  accu¬ 
racy  for  noisy  speech.  In  SNR=OdB  our  proposed  method  attained 
16%  higher  performance  compared  to  a  product  HMMs  without 
the  synchronicity  re-estimation. 

1.  INTRODUCTION 

The  performance  of  ASR  systems  has  been  drastically  improved 
recently.  However,  it  is  well  known  that  the  performance  can  be  se¬ 
riously  degraded  in  acoustically  noisy  environments.  Audio-visual 
ASR  [1,  2,  4]  systems  offer  the  possibility  of  improving  the  con¬ 
ventional  speech  recognition  performance  by  incorporating  visual 
information,  since  the  speech  recognition  performance  is  always 
degraded  in  acoustically  noisy  environments  whereas  visual  infor¬ 
mation  is  not. 

Audio  and  visual  phonetic  features  have  different  durations. 
In  other  words,  there  is  loose  synchronicity  between  them,  for  in¬ 
stance,  a  speaker  opens  the  mouth  before  making  an  utterance, 
and  closes  it  after  making  the  utterance.  Furthermore,  the  time 
lag  between  the  movement  of  the  mouth  and  the  voice  might  be 
dependent  on  the  speaker  or  context. 

As  audio-visual  integration  methods  for  ASR  systems,  early 
integration  and  late  integration  are  well  known  [1,2],  In  the  early 
integration  scheme,  a  conventional  HMM  is  trained  using  audio¬ 
visual  data.  This  method,  however,  cannot  sufficiently  represent 
the  loose  synchronization  between  the  audio  and  visual  informa¬ 
tion.  Furthermore,  the  visual  features  of  the  conventional  HMM 
may  end  up  relatively  poorly  trained  because  of  mis-alignments 
during  the  model  estimation  caused  by  the  segmentation  of  the  au¬ 
dio  features.  In  the  late  integration  scheme,  the  audio  data  and  vi¬ 
sual  data  are  processed  separately  to  build  two  independent  HMMs 


[1,4],  This  scheme  assumes  complete  asynchronization  between 
the  audio  and  visual  features.  In  addition,  it  can  make  the  best  use 
of  the  audio  and  visual  data  because  there  is  a  smaller  bi-modal 
database  than  the  typical  database  for  audio  only.  However,  the 
audio  and  visual  features  are  regarded  as  independent.  In  this  pa¬ 
per,  in  order  to  model  the  synchronization  between  audio  and  vi¬ 
sual  features,  we  propose  pseudo-biphone  product  HMMs  which 
realizes  state  synchronous  audio-visual  integration.  The  proposed 
model  can  represent  synchronicity  not  only  within  a  phoneme  but 
also  beyond  phoneme  boundaries.  Furthermore,  we  propose  a  new 
method  based  on  GPD  algorithm  to  optimize  stream  weights  of  the 
proposed  pseudo-biphone  product  HMMs. 

2.  AUDIO-VISUAL  INTEGRATION  BASED  ON 
PRODUCT  HMM 

Figure  1  shows  the  outline  of  the  acoustic  model  training  for  ASR 
systems  in  this  paper.  Figure  2  shows  the  proposed  HMM  topol¬ 
ogy.  First,  in  order  to  create  the  audio  and  visual  phoneme  HMMs 
independently,  audio  features  and  visual  features  are  extracted  from 
audio  data  and  visual  data,  respectively.  In  general,  the  frame  rate 
of  audio  features  is  higher  than  that  of  visual  features.  Accord¬ 
ingly,  the  extracted  visual  features  are  incorporated  such  that  the 
audio  and  visual  features  have  the  same  frame  rate.  Second,  the  au¬ 
dio  and  visual  features  are  modeled  individually  into  two  HMMs 
by  the  EM  algorithm.  Finally,  an  audio-visual  phoneme  HMM 
is  composed  as  the  product  of  these  two  HMMs  based  on  HMM 
composition.  The  output  probability  at  state  ij  of  the  audio-visual 
HMM  is. 


bij(Ot)  =  b?{0?)aA  xbJ(OY)av  (1) 

which  is  defined  as  the  product  of  the  output  probabilities  of  the  au¬ 
dio  and  visual  streams.  Here,  bf(Of)aA  is  the  output  probability 
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Fig.  2.  Product  HMM 


of  the  audio  feature  vector  at  time  instance  t  in  state  i,  b'j  (OY  )av 
is  the  output  probability  of  the  visual  feature  vector  at  time  in¬ 
stance  t  in  state  j,  and  a  a  and  ay  are  the  audio  stream  weight  and 
visual  stream  weight,  respectively.  In  a  similar  manner,  the  transi¬ 
tion  probability  from  state  ij  to  state  kl  in  the  audio-visual  HMM 
is  defined  as  follows, 

Pij-kl  =  Pa,  k  X  Pvjj  (2) 

where  pa.  k  is  the  transition  probability  from  state  i  to  state  k  in 
the  audio  HMM,  and  pVjJ  is  the  transition  probability  from  state  j 
to  state  l  in  the  visual  HMM.  This  composition  is  performed  for  all 
phonemes.  In  the  method  proposed  by  [4],  a  similar  composition 
is  used  for  the  audio  and  visual  HMMs.  However,  because  the 
audio  and  visual  HMMs  are  trained  individually,  the  dependencies 
between  the  audio  and  visual  features  are  ignored.  This  results  in 
the  following  two  problems. 

1 .  The  product  HMMs  can  not  represent  the  loose  synchronic- 
ity  within  phonemes  as  it  is. 

2.  The  product  HMMs  force  a  strict  synchronization  on  every 
phoneme  boundary. 

This  paper  proposes  a  new  approach  to  solve  the  two  prob¬ 
lems.  The  approach  proposes  re-estimation  of  the  product  HMMs 
parameters  by  using  a  small  amount  of  audio-visual  synchronous 
adaptation  data,  and  pseudo-biphone  product  HMMs  which  repre¬ 
sent  loose  state  synchronicity  beyond  the  phoneme  boundary. 

2.1.  State  Synchronous  Modeling  within  a  Phoneme 

The  first  problem  is  from  the  inability  of  the  conventional  product 
HMMs  to  represent  loose  state  synchronicity  within  a  phoneme. 
This  problem  is  caused  by  the  fact  that  the  transition  probabilities 
and  output  probabilities  are  obtained  by  the  multiplication  of  prob¬ 
abilities  from  independent  states  of  audio  and  visual  HMMs.  We 
propose  new  product  HMMs  whose  parameters  are  re-estimated 
using  audio-visual  synchronous  adaptation  data  [3],  The  re-estimation 
is  able  to  introduce  the  loose  state  synchronicity  of  the  states  of  two 
modalities  into  the  product  HMM.  The  re-estimation  procedure  is 
carried  out  using  a  small  amount  of  audio-visual  synchronous  data. 
After  the  composition  of  two  HMMs,  the  product  HMMs  can  be 
re-estimated  based  on  the  Baum-Welch  aleorithm  for  multi-stream 
HMMs. 


Figure  3  shows  results  comparing  audio  HMMs,  visual  HMMs, 
early  integration,  late  integration,  and  product  HMMs  with  and 
without  re-estimation  [3].  The  experimental  conditions  are  the 
same  as  those  in  a  later  section  except  that  the  audio  HMMs  are 
trained  using  clean  speech  data.  The  figure  shows  that  the  product 
HMMs  with  re-estimation  achieve  the  best  performance,  while  the 
product  HMMs  without  re-estimation  are  worse  than  those  of  the 
early  and  late  integration  schemes. 

2.2.  State  SynchronousModeling  Beyond  The  Phoneme  Bound¬ 
ary 

The  second  problem  is  that  the  conventional  product  HMMs  force 
a  strict  synchronization  on  every  phoneme  boundary.  This  is  be¬ 
cause  the  speech  organs  normally  move  earlier  than  the  speech  to 
be  produced.  Sometimes,  the  speech  organs  are  already  articulated 
in  the  previous  audio  phoneme  utterance.  Accordingly,  we  have  to 
consider  state  synchronous  modeling  beyond  the  phoneme  bound¬ 
ary.  We  have  carried  out  preliminary  experiments  using  audio¬ 
visual  word  HMMs  and  confirmed  that  synchronicity  is  not  always 
kept  on  a  phoneme  boundary  looking  at  the  optimal  paths[5]. 

We  propose  new  product  HMMs  that  include  extra  asynchronous 
states  on  phoneme  boundaries  as  indicated  in  Fig.  4.  The  core 
states  of  the  phoneme  HMMs  are  the  same  as  those  of  context  in¬ 
dependent  phoneme  product  HMMs.  In  addition,  the  new  product 
HMMs  have  two  extra  HMM  states  aiming  to  work  similarly  to 
the  word  HMMs.  The  first  extra  state  is  composed  of  the  initial 
audio  state  and  final  visual  state  of  the  preceding  phoneme  HMM. 
The  second  extra  state  is  composed  of  the  initial  visual  state  and 
final  audio  state  of  the  preceding  phoneme  HMM.  Since  these  ex¬ 
tra  states  are  dependent  on  the  preceding  phoneme,  they  can  only 
be  re-estimated  in  a  manner  similar  to  the  biphone  HMMs.  There¬ 
fore,  we  call  these  HMM  pseudo-biphone  product  HMMs.  The 
proposed  HMMs  can  tolerate  one  state  asynchronicity  beyond  a 
phoneme  boundary. 

3.  STREAM  WEIGHT  OPTIMIZATION 

As  methods  for  estimating  stream  weights,  maximum  likelihood 
[6]  based  methods  or  GPD  (Generalized  Probabilistic  Descent)[7] 
based  methods  have  been  proposed.  However,  the  former  meth¬ 
ods  have  a  serious  estimation  drawback  because  the  scales  of  two 
probability  are  normally  very  different  and  so  the  weights  can  not 
be  estimated  optimally.  The  latter  methods  have  substantial  pos¬ 
sibility  for  optimizing  the  weights.  However,  a  serious  problem 
is  that  these  methods  require  a  lot  of  adaptation  data  is  necessary 
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4.  EVALUATION  EXPERIMENTS 


Fig.  4.  Pseudo-biphone  product  HMMs 


for  the  weight  estimation.  In  this  paper,  we  propose  a  GPD-based 
simplified  adaptive  estimation  of  stream  weights  using  GMMs  for 
new  noisy  acoustic  conditions. 

The  approach  by  the  GPD  training  defines  a  misclassification 
measure,  which  provides  distance  information  concerning  the  cor¬ 
rect  class  and  all  other  competing  classes.  The  misclassification 
measure  is  formulated  as  a  smoothed  loss  function.  This  loss  func¬ 
tion  is  minimized  by  the  GPD  algorithm.  Here,  let  L(x\A)  be  the 
log-likelihood  score  in  recognizing  input  data  x  for  adaptation  us¬ 
ing  the  correct  word  model,  where  A  =  { ,  Av }- 

In  a  similar  way,  let  Ln\ A)  be  the  score  in  recognizing  data 
x  using  the  n-th  best  candidate  among  the  mistaken  word  models. 

The  misclassification  measure  is  defined  as, 

N 

dix)  =  -L?)(A)-t-log[^^exP{r7LW(A)}]^  (3) 

n=  1 

where  77  is  a  positive  number,  and  N  is  the  total  number  of  candi¬ 
dates.  The  smoothed  loss  function  for  each  data  is  defined  as, 

Z(I)  =  [1  +  exp{— ad^(A)}]-1  (4) 

where  a  is  a  positive  number.  In  order  to  stabilize  the  gradient,  the 
loss  function  for  the  entire  data  is  defined  as, 

X 

1(A)  =  £/(i)(A)  (5) 

X  =  1 

where  X  is  the  total  amount  of  data.  The  minimization  of  the 
loss  function  expressed  by  equation  (5)  is  directly  linked  to  the 
minimization  of  the  error.  The  GPD  algorithm  adjusts  the  stream 
weights  recursively  according  to, 

A*+1  =  A*  —  ekEkVl(X),  k=  1, (6) 

where  £k  >  0,  £k  =  EkLi  £\  <  00,  and  E  is  a  unit 
matrix. 

In  this  paper,  we  propose  to  use  GMMs  instead  of  HMMs  to 
find  optimal  stream  weights  not  for  the  recognition.  GPD  training 
on  GMMs  is  quite  simple  and  requires  smaller  amount  of  training 
data.  We  use  18  mixture  Gaussians  for  GMMs  and  train  them 
using  all  of  the  training  data. 


The  audio  signal  is  sampled  at  12  kHz  (down-sampled)  and  ana¬ 
lyzed  with  a  frame  length  of  32  msec  every  8  msec.  The  audio  fea¬ 
tures  are  16-dimensional  MFCC  and  16-dimensional  delta  MFCC. 

On  the  other  hand,  the  visual  image  signal  is  sampled  at  30  Hz  with 
256  gray  scale  levels  from  RGB.  Then,  the  image  level  and  loca¬ 
tion  are  normalized  by  a  histogram  and  template  matching.  Next, 
the  normalized  images  are  analyzed  by  two-dimensional  FFT  to 
extract  6x6  log  power  2-D  spectra  for  audio-visual  ASR.  Finally, 
35-dimensional  2D  log  power  spectra  and  their  delta  features  are 
extracted.  For  each  modality,  the  basic  coefficients  and  the  delta 
coefficients  are  collectively  merged  into  one  stream.  Since  the 
frame  rate  of  the  video  images  is  1/30,  we  insert  the  same  im¬ 
ages  so  as  to  synchronize  the  face  image  frame  rate  to  the  audio 
speech  frame  rate.  For  the  HMMs,  we  use  a  two-mixture  Gaussian 
distribution  and  assign  three  states  for  the  audio  stream  and  two 
states  for  the  visual  stream  in  the  late  integration  HMMs  and  the 
baseline  product  HMMs.  In  this  research,  we  perform  word  recog¬ 
nition  evaluations  using  a  bi-modal  database  [1],  We  use  4740 
words  for  HMM  training  and  two  sets  of  200  words  for  testing. 
These  200  words  are  different  from  the  words  used  in  the  training. 

We  perform  experiments  using  15,  25,  and  50  words.  The  con¬ 
text  of  the  data  for  the  adaptation  differs  from  that  of  the  test  data. 

In  order  to  examine  in  more  detail  the  estimation  accuracy  in  the 
case  of  less  adaptation  data,  we  carry  out  recognition  experiments 
using  three  sets  of  data,  each  as  different  as  possible  from  the  con¬ 
text.  The  size  of  the  vocabulary  in  the  dictionary  is  500  words 
during  the  recognition  of  the  adaptation  data.  The  GPD  algorithm 
convergence  pattern  is  known  to  greatly  depend  on  the  choice  of 
parameters.  Accordingly,  we  set  N  =  1  in  (3),  N  =  0.1  in  (4), 

N  =  100 /k,  and  the  maximum  the  iteration  count  =  8. 

We  compared  the  processed  product  HMMs  without  re-estimation 
(Product-HMM(W/0  Re-est.)),  the  proposed  product  HMMs  with 
re-estimation  (Product-HMM(W  Re-est.)),  the  proposed  pseudo¬ 
biphone  product  HMMs  without  re-estimation  (Pseudo-Biphon(W/0 
Re-est.)),  the  proposed  pseudo-biphone  product  HMMs  with  re¬ 
estimation  (Pseudo-Biphon(W  Re-est.)),  and  GMM  for  GPD-based 
stream  weight  optimization  for  acoustic  SNR=15,  0,  and  -5dB. 
White  noise  was  used  to  reduce  the  acoustic  SNR  in  this  exper¬ 
iment.  The  audio  HMMs  were  trained  using  the  SNR=15dB  data. 
The  results  indicate  that  the  re-estimation  of  the  product  HMMs  is 
quite  effective  to  improve  the  performance.  The  re-estimation  is 
able  to  introduce  the  loose  state  synchronicity  of  the  states  of  two 
modalities  into  the  product  HMMs.  The  state  synchronous  mod¬ 
eling  beyond  the  phoneme  boundary  by  a  pseudo-biphone  prod¬ 
uct  HMM  also  results  in  significant  improvements  to  the  product 
HMMs.  It  is  also  confirmed  that  the  re-estimation  further  im¬ 
proves  performance  of  pseudo-biphone  product  HMMs.  The  fig¬ 
ures  show  optimal  stream  weights  for  the  maximum  performance 
vary  according  to  each  method  and  acoustic  SNR.  The  solid  ar¬ 
rows  show  the  results  by  simplified  GPD-based  stream  weight  es¬ 
timation  using  25  adaptation  words.  The  proposed  GPD-based 
simplified  stream  weight  optimization  algorithm  successfully  es¬ 
timated  stream  weight  with  almost  the  best  performance.  In  the 
SNR=-5dB  environment,  the  estimated  weight  is  not  the  optimal 
one.  Figure  8  shows  standard  deviation  of  the  word  accuracy  over 
various  SNRs,  a  number  of  adaptation  words,  and  a  number  of  can¬ 
didates  in  GPD  training.  It  is  confirmed  the  standard  deviation  in 
SNR=-5dB  is  bigger  than  the  others  and  smaller  number  of  adap¬ 
tation  words  gives  bigger  standard  deviations.  In  SNR=0dB  our 
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Fig.  5.  Word  Accuracy  (SNR=15dB) 


Fig.  7.  Word  Accuracy  (SNR=-5dB) 
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Fig.  6.  Word  Accuracy  (SNR=0dB) 


Fig.  8.  Standard  Deviation  of  Word  Accuracy 


proposed  method  attained  16%  higher  performance  compared  to  a 
product  HMMs  without  the  synchronicily  re-estimation. 

5.  CONCLUSION 

This  paper  proposes  a  new  HMM  structure  to  effectively  inte¬ 
grate  audio  and  visual  information  in  audio-visual  (bi-modal)  sys¬ 
tems.  Our  state  synchronous  modeling  of  audio-visual  informa¬ 
tion  is  based  on  the  product  HMM.  The  proposed  model  can  rep¬ 
resent  synchronicity  not  only  within  a  phoneme  but  also  between 
phonemes.  Evaluation  experiments  show  that  the  re-estimation  of 
the  model  parameters  using  audio-visual  synchronous  data  further 
improves  the  product  HMMs.  In  addition,  pseudo-biphone  HMMs 
that  introduce  two  extra  asynchronous  states  are  shown  to  improve 
the  bimodal  speech  recognition  accuracy.  Furthermore,  we  also 
proposed  a  rapid  stream  weight  optimization  based  on  GPD  algo¬ 
rithm  for  noisy  bi-modal  speech  recognition. 
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Abstract 

Improving  the  accuracy  of  speech  recognition  technology  by  ad¬ 
dition  of  visual  information  is  the  key  approach  to  multi-modal 
ASR  research.  In  this  work,  we  address  two  important  issues, 
which  are  lip  tracking  and  the  visual  speech  feature  extraction 
algorithm.  In  order  to  utilize  the  multi-modal  ASR  for  natural 
speech,  the  visual  front  end  algorithm  must  extract  affine  and 
lighting  condition  im'arianl  visual  speech  features. 

This  paper  focuses  on  both  the  lip  tracking  algorithm  using 
the  Bayesian  framework  and  a  novel  pixel  based  visual  speech 
feature  extraction  algorithm  based  on  kurtosis  measures  of  the 
frequency  profile  of  the  local  image  blocks.  We  compare  the 
results  of  the  proposed  features  with  the  results  of  outer  lip  con¬ 
tour  based  affine-invariant  visual  features,  and  global  2D  DCT 
features.  Experimental  results  in  this  paper  are  presented  for 
a  visual-only  connected  digit  recognition  task  for  performance 
comparison  of  the  visual  features. 

Keywords:  Lip  tracking.  Visual  feature  extraction,  Kur¬ 
tosis  measure. 

1.  Introduction 

The  addition  of  visual  information  to  audio  features  im¬ 
proves  speech  understanding  and  offers  key  advantages  in 
human-computer  interfaces  especially  in  difficult  environ¬ 
ments  [1-6].  Improving  the  existing  state-of-the-art  auto¬ 
matic  speech  recognition  (ASR)  performance  by  integrat¬ 
ing  the  visual  information  of  the  speaker’s  mouth  region  is 
receiving  significant  attention  from  the  speech  recognition 
communities. 

Some  of  the  initial  difficulties  difficulty  associated  with 
computer  lipreading  (visual  speech  recognition)  are  the  ac¬ 
curate  and  consistent  visual  region  of  interest  (ROI)  extrac¬ 
tion,  and  lip  tracking  algorithm  on  the  fly,  which  needs  to 
be  robust  to  a  speaker’s  ethnic  and  gender  variability,  and 
other  visual  appearances  such  as  glasses,  facial  hair,  various 
skin  color,  lip  color,  and  different  lip  shapes.  Another  dif¬ 
ficulty  difficulty  is  the  robust  and  consistent  visual  speech 
feature  extraction. 

The  development  of  a  successful  audio-visual  speech 
recognition  technology  capable  of  adapting  itself  to  chang¬ 
ing  environments  will  support  both  industrial  and  military 
applications.  Audio-visual  speech  recognition  research  is  a 
relatively  new  and  advancing  research  area.  A  noise  robust 
audio-visual  speech  recognition  system  will  facilitate  use 
of  computers,  increase  reliability  and  worker  productivity, 
and  naturalize  communications  between  human  and  com¬ 
puters.  In  addition,  audio-visual  speech  recognition  tech¬ 
nology  can  facilitate  new  commercial  applications  such  as 
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text-driven  audio-visual  talking  head,  audio-visual  speech- 
to-speech  translation,  and  speech-to-video  conversion  for 
the  hearing  impaired. 

In  our  earlier  research  [1,7],  we  have  implemented 
both  late  integration  and  early  (multi-stream  state  syn¬ 
chronous)  integration  schemes  for  a  controlled  audio-visual 
data  set.  For  both  integration  schemes,  the  experimental  re¬ 
sults  showed  that  addition  of  visual  information  improves 
the  recognition  performance.  In  this  paper,  the  following 
objectives  will  be  sought: 

1.  Development  of  a  lip  tracking  algorithm,  and 

2.  A  novel  visual  speech  feature  extraction  algorithm 
that  satisfies  the  following  three  criteria: 

i.  Affine  (rotation,  scale,  and  shear)  invariance, 

ii.  Chrominance  space  shift  invariance,  and 

iii.  Chrominance  space  scale  invariance. 

In  our  proposed  visual  speech  feature  extraction  method, 
the  criteria  in  step  (i)  is  satisfied  by  affine  correction,  the 
criteria  in  step  (ii)  is  satisfied  by  removing  of  the  DC  com¬ 
ponent  of  the  2D  DCT  coefficients,  and  the  criteria  in  step 
(iii)  is  satisfied  by  the  normalized  higher  order  moments  of 
the  DCT  coefficients  of  the  lip  image  blocks. 

This  work  is  organized  as  follows.  In  section  2,  we 
present  a  Bayesian  framework  for  lip  tracking,  parametric 
formulation  of  the  Gaussian  parameters  and  adaptation  of 
the  parameters  on  the  fly.  Section  3  discusses  the  removal  of 
affine  (rotation,  scale,  shear)  effects  from  the  segmented  lip 
image.  In  section  4,  we  discuss  contour  based  affine  invari- 
nat  features,  pixel  based  normalized  2D  DCT  features,  and 
describe  a  novel  visual  speech  feature  extraction  algorithm 
based  on  kurtosis  measures  of  the  frequency  profile  of  the 
local  image  blocks  of  the  mouth.  We  present  the  experimen¬ 
tal  setup  and  the  results  in  Section  5.  Section  6  gives  the 
concluding  remarks  and  the  proposed  future  work. 

2.  Lip  Tracking  Using  the  Bayesian 
Framework 

The  basis  of  the  audio-visual  speech  recognition  system  is 
an  efficient  lip  tracking  algorithm.  Computational  time 
constraints  required  by  applications  such  as  audio-visual 
speech  recognition,  animated  talking  head  design,  etc.,  con¬ 
tribute  to  the  difficulty  of  the  task.  Most  lip  tracking  algo¬ 
rithms  build  upon  the  eigenspace  based  face  detector  and 
an  ensemble  of  feature  detectors  which  are  used  to  extract 
pre-specified  landmarks  such  as  nostrils  and  lip  corners  to 
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locate  the  ROI  (mouth  region)  [8,9].  The  deformable  tem¬ 
plate  and  snake  based  methods  [10, 11]  have  also  been  used 
for  this  task.  All  techniques  have  reported  good  results, 
but  accuracy  has  decreased  when  there  are  occlusion  (pro¬ 
file  view),  lighting  condition  change,  texture  changes,  and 
quick  motion.  The  technique  we  propose  uses  color  images 
with  Bayesian  framework  for  classification  which  requires 
the  estimation  of  the  a  priori  probabilities  and  class  condi¬ 
tional  density  models.  The  class  conditional  density  and  a 
priori  probability  estimation  processes  are  described  in  the 
following  sections. 

In  the  lip  tracking  problem  there  are  two  distinct 
classes,  lip  and  non-lip.  Therefore,  in  this  section,  the  two 
class  classification  problem  is  discussed  because  each  sam¬ 
ple  in  the  image  frame  either  belongs  to  lip  class,  w\  or  non¬ 
lip  class,  W2.  The  conditional  density  functions  and  the  a  pri¬ 
ori  probabilities  are  estimated  using  the  training  data  that 
may  require  extensive  search  to  locate  the  lip  and  non-lip 
regions  in  the  first  frame  in  practice  which  will  not  be  dis¬ 
cussed  here.  The  Bayes  decision  rule  determines  whether 
an  observation,  x,  belongs  to  wi  or  W2.  One  of  the  most 
commonly  utilized  probability  density  functions  in  practice 
is  the  Gaussian  density  function  due  to  its  computational 
simplicity  and  because  it  models  a  large  number  of  cases  in 
nature.  The  Gaussian  parameters  are  estimated  parametri¬ 
cally  using  the  information  from  the  previous  frame  on  the 
fly  which  leads  to  an  adaptive  real  time  lip  tracking  and  seg¬ 
mentation  algorithm. 

2.1.  Parametric  Formulation  of  Gaussian  Density  from 
Sample  Data 

In  the  parametric  formulation  of  the  multivariate  Gaussian 
density,  estimation  of  the  mean  vector  and  covariance  ma¬ 
trices  of  the  two  classes,  wi  and  W2,  are  required.  Let  N  be 
the  number  of  samples  drawn  from  a  class,  Wj,  with  respect 
to  x  in  the  n-dimensional  feature  space.  Then  the  general 
multivariate  Gaussian  (normal)  density  given  by 

p(x|“" 1  °  y^)1. mrp{~ * (x - *)” sr' (x -'*)}' <l) 

i  =  W  1,W2- 

where  /m  =  £[x]  is  the  mean  value  of  the  class  w,,  and  Ej  is 
the  n  x  n  covariance  matrices  defined  as 

Ei  =  £[(x -/ii)(x-Alj)T]  (2) 

||E,||  represents  the  determinant  of  Ej  and  J5[.]  is  the  ex¬ 
pected  value  of  a  random  variable.  The  parameters  m  and 
Ej  can  be  estimated  without  bias  by  the  sample  mean  and 
sample  covariance  matrix  as 


H  ]Cxi0’  *  =  Wl,W2  (3) 

3  =  1 


=  i  =  m,  w2  (4) 

3  =  1 

where  xj'^  is  the  jth  sample  vector  from  the  ith  class. 
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2.1.1.  Class  Conditional  Mixture  Density  Estimation 

Given  the  data  sets  for  lip  and  non-lip  classes  from  the  previ¬ 
ous  frame,  we  can  form  the  class  conditonal  mixture  density 
function  in  general  as  follows. 

1 .  Form  a  6-dimensional  attribute  data  set  for  each  class 
from  color  and  texture  measures  (R,  G,  B,  Rv,  Gv, 
Bv)  for  each  pixel  location,  and  cluster  it  (possibly 
into  three  clusters  for  lip,  tongue,  and  teeth)  using  an 
unsupervised  K-means  clustering  algorithm. 

2.  Form  the  parametric  class  conditional  density  models 
P(x  |  w^)  using  the  method  described  in  Section  2.1 
for  each  cluster,  where  i  represents  the  cluster  i.d. 

3.  Similarly,  repeat  step  2-6  to  form  the  parametric  class 
conditional  density  models  P(x  |  W^l )  for  non-lips 
(nL). 

4.  Form  the  conditional  density  mixture  models  using 
weighted  sum  of  the  conditional  densities  belonging 
to  clusters.  That  is, 

c 

P{x  I  Wi)  =  Cm^(x  |  w-m)),  i  =  L,nL  (5) 

m  =  l 

where  C  is  the  number  of  cluster  for  the  lip  or  non¬ 
lip  class,  and  Cm  —  nm  /N  is  the  mixture  weight  ob¬ 
tained  by  taking  the  ratio  of  the  number  of  pixels  in 
cluster  m  to  total  number  of  pixels  in  that  class. 

2.1.2.  A  Priori  Probability  Estimation 

As  shown  in  Equation  10,  a  priori  probability  specification  is 
an  important  task  for  a  Bayesian  classifier  since  the  thresh¬ 
old  value  of  the  likelihood  ratio  is  based  on  the  a  priori  class 
probabilities.  Basically,  it  is  desired  to  obtain  a  speaker  and 
time  (frame)  dependent  Bayesian  parameter  set  to  adapt  the 
skin  tone  color  variations  and  lighting  variations  on  the  fly. 
The  selection  of  the  sample  data  for  obtaining  class  mean 
vectors  and  covariance  matrixes  has  direct  effect  on  the 
parametric  representation  of  the  class  conditional  density 
models.  Calculating  the  a  priori  class  probabilities  based 
on  the  number  of  pixels  in  each  class  data  is  biased  to  the 
sample  data  so  it  would  be  a  poor  choice.  By  careful  ex¬ 
amination  of  the  multi-variate  Gaussian  density  function  in 
Equation  1,  one  intuitional  choice  of  the  a  priori  class  proba¬ 
bilities  would  be  biasing  them  to  determinant  of  the  covari¬ 
ance  matrixes  of  the  classes,  as 

p(w)  =  iMTii&r  •=*■•“■  (6) 

where  p(w i )  +  p(w2)  =  1.  Figure  1  shows  the  class  regions 
based  on  the  threshold  value  of  the  likelihood  ratio  (Bayes 
decision  rule)  and  the  effect  of  a  priori  class  probability  se¬ 
lection. 

2.2.  Bayesian  Decision  Rule 

Let  x  be  an  observation  vector  (a  set  of  features  belong  to 
a  pixel  location  in  the  image  frame).  Our  goal  is  to  design 
a  Bayes  classifier  to  determine  whether  x  belongs  to  wj  or 
W2.  The  Bayes  test  using  a  posteriori  probabilities  may  be 
written  as  follows: 

U>2 

p(wi  I  x)  ^  p(w,  I  x), 

Wi 


(7) 


Figure  1 :  Bayes  decision  rale  and  the  effect  of  the  a  priori  class 
probability  values. 


where  p{wi  |  x)  is  a  posteriori  probability  of  vh  given  x. 
Equation  7  shows  that,  if  the  probability  of  w\  given  x  is 
larger  than  the  probability  of  w2,  then  x  is  declared  be¬ 
longing  to  w i ,  and  vice  versa.  Since  direct  calculation  of 
p(wi  |  x)  is  not  practical,  we  can  re-write  the  a  poste¬ 
riori  probability  of  w,  using  the  Bayes  theorem  in  terms 
of  a  priori  probability  and  the  conditional  density  function 
p(x  |  wi),  as 

rtw,  I  x)  =  1  (8) 

pm 


where  p(x  is  the  mixture  density  function,  and  is  positive 
and  constant  for  all  classes.  Then,  the  decision  rule  shown 
in  Equation  7  can  be  written  as 


W2 

p(x  |  Wl)p(wi)  p(x  I  W2)p(W2) 


W\ 


or  re-arranging  both  sides,  we  get 

r/  X  _  P(X  I  Wl)  ^  P(W 2) 
p(x  I  IU2)  p(wi) 


(9) 


(10) 


where  L(x)  is  called  the  likelihood  ratio,  and  p(w2)/p(wi )  is 
called  the  threshold  value  of  the  likelihood  ratio  for  the  deci¬ 
sion.  As  shown  in  Equation  10  a  priori  probability  specifica¬ 
tion  is  an  important  task  for  a  Bayesian  classifier.  Because 
of  the  exponential  form  of  the  involved  densities  in  Equa¬ 
tion  10,  it  is  preferable  to  work  with  the  monotonic  func¬ 
tions  called  discriminant  functions  following  discriminant 
functions  obtained  by  taking  the  logarithm  of  both  sides  of 
the  Equation  shown  in  9. 


qt  (x)  =  Jn(p(x  |  Wi)p(wi)),  or  (1 1) 

Qi (x)  =  -i(x  -pi)TEr1(x-  Pi)  +  In p(wi)  +  a  (12) 

where  a  =  —(1/2)  In  27r  —  (1/2)||£;||  is  a  constant.  In  gen¬ 
eral  Equation  12  has  a  nonlinear  quadratic  form  and  using 
Equation  12,  the  Bayes  rule  is  as  follows,  which  is  preferable 
for  the  efficiency  of  calculation  speed. 

W2 

9i(x)^?2(x).  (13) 

w-i 


2.3.  Lip  Tracking  Algorithm  and  ROI  Selection 

The  Bayesian  framework  descibed  in  this  paper  utilizes 
color  images  with  no  prior  labeling.  The  goal  is  to  segment 
the  lip  region  in  the  current  frame  and  select  the  ROI  for 
the  following  frame  to  limit  the  search  space.  The  basic  lip 
tracking  and  ROI  selection  procedures  are  described  below. 


•  Obtain  qi  (x)  and  92  (x)  using  Equation  11  for  every 
pixel  in  the  image. 

•  Use  an  averaging  filter  on  the  91  (x)  and  g2  (x)  to  ob¬ 
tain  {Si  (a;)}  and  {ft  (a:)}.  The  smoothing  operation 
reduces  the  noise  effect. 

•  Apply  the  Bayesian  classification  rule  to  every  pixel  in 
the  image  frame  to  obtain  binary  lip  candidate  pixels, 

as 

U>2 

ft(x)  sg  ft(x).  (14) 

”1 

•  Segment  the  lip  region  (using  the  heuristics  such  as 
largest  region  between  nostrils  and  chin)  in  the  bi¬ 
nary  image  resulted  from  the  Bayes  classifier. 

The  Bayesian  classifier  is  applied  to  the  full  image  array 
for  the  first  frame.  But  once  the  lip  region  is  detected  on 
the  current  frame,  the  next  frame’s  search  space  is  bounded 
by  a  rectangular  ROI,  obtained  by  enlarging  the  current  lip 
region  by  25%  of  width  and  height  in  vertical  and  horizon¬ 
tal  directions,  respectively.  Thus,  the  Bayesian  classifier  is 
applied  to  the  ROI  on  the  next  frame  to  enable  the  real  time 
lip  tracking  instead  of  the  full  image  array  search. 

Adapting  classifier  parameters  on  the  fly  makes  algo¬ 
rithm  more  robust  to  lighting  changes  between  frames.  Also 
the  initial  color  information  extracted  from  the  first  image 
frame  may  have  several  problems  with  changing  conditions. 
Firstly,  the  color  features  obtained  for  a  person  by  a  camera 
is  influenced  by  the  ambient  lighting  conditions  and  orienta¬ 
tion  of  the  speaker’s  face  during  speech.  Secondly,  different 
cameras  produce  significantly  different  color  features  even 
for  the  same  person  under  same  lighting  conditions.  Our 
work  aims  to  overcome  this  difficulty  by  adapting  the  clas¬ 
sifier  parameters  on  the  fly  using  the  information  from  the 
previous  frame.  The  procedure  is  described  as 

•  Extract  the  color  features  for  lip  class. 

•  Extract  the  color  features  for  non-lip  class. 

•  Update  the  classifier  parameters  using  the  data  ab- 
tained  from  above  two  steps. 

3.  Removing  Affine  Parameters  from  Lip 
image 

In  the  audio-visual  speech  and  speaker  recognition  task, 
both  contour  based  and  pixel  based  visual  features  need  to 
be  independent  from  the  affine  (rotation,  scale,  shear  and 
translation)  parameters.  In  order  to  utilize  the  audio-visual 
speech  and  speaker  recognizer  for  natural  speech,  the  lip 
image  for  every  frame  needs  to  be  pre-processed  for  remov¬ 
ing  the  affine  parameters  before  the  visual  feature  extrac¬ 
tion  process  described  in  the  following  sections  is  applied. 
Then,  a  question  can  be  posed  whether  if  affine  (rotation, 
scale,  shear  and  tranlation)  parameters  convey  linguistic  in¬ 
formation  to  utilize  for  the  recognition  task. 

3.1.  Lip-Rotation  Problem 

Lip-rotation  correction  on  the  fly  for  natural  speaker  move¬ 
ment  is  essential  for  robust  audio-visual  speech  and  speaker 
recognition.  Utilizing  lip  corners  or  some  other  facial  fea¬ 
tures  such  as  nostrils  and  eye  corners  may  be  problematic 
for  rotation  correction  due  to  the  complexity  of  locating 
such  facial  features  accurately  during  natural  speech  [9, 12]. 
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Figure  2:  Lip  rotation  correction:  a)  rotation  correction  using 
the  PCA,  b)  outer  lip  contour  after  rotation  correction,  c)  gray 
lip  image  after  ration  correction  and  scaling  to  96x64  pixels. 

We  propose  a  principal  component  analysis  (PCA)  based 
rotation  estimation  and  correction  method  to  overcome  the 
difficulties  mentioned  above.  Jump 

3. 1. 1.  Rotation  Correction  Using  PCA 

Principal  component  analysis  (PCA)  is  a  method  for  analyz¬ 
ing  multivariate  data  to  identify  a  set  of  new  orthogonal  axes 
known  as  principal  components.  The  first  principal  compo¬ 
nent  is  the  axis  that  describes  most  variance  of  the  data,  the 
second  principal  component  is  the  orthogonal  axis  that  de¬ 
scribes  the  second  most  variance  of  the  data,  and  so  on.  PCA 
is  also  called  the  Hotelling  transform  or  Karhunen-Loeve 
expansion  [13]. 

Let  x  =  [a;,x2]'r  be  a  2-dimensional  random  variable 
with  mean  mz  and  covariance  matrix  C  based  on  N  sam¬ 
ples  of  a  lip  image  pixel  locations.  The  mathematical  repre¬ 
sentation  of  PCA  as  follows. 

1  N 

mxk  ~  77  '52Xki'  k  =  1,2  so  (15) 

1=1 

m,  =  [ro,  i  mz2]r  and  (16) 

1  N 

c  =  Jjd yH(Xi  -mx){xi  -mx)T,  (17) 

1  t=i 

where  T  represents  the  transpose  operation.  The  task  is  to 
find  the  new  set  of  orthogonal  axes  and  estimate  the  rotation 
angle  with  the  standard  coordinate  system,  and  then  undo 
the  rotation  of  the  lip  pixel  coordinate  data.  Figure  2  shows 
the  rotation  correction  using  the  PCA  coordinate  rotation. 

In  order  to  estimate  the  rotation  angle  a  between  r-axis 
and  u-axis  shown  in  Figure  2a,  we  solve  for  the  eigenvalues 
{  Ai ,  A2  }  of  the  covariance  matrix  C  and  find  the  eigenvector 
ei  corresponding  to  the  largest  eigenvalue.  The  process  is  as 
follows: 


Figure  3:  An  example  of  the  scaling  problem  due  to  speaker’s 
distance  to  camera  or  speaker’s  lip  physical  dimensions. 


x 


Figure  4:  An  illustration  of  the  shearing  in  the  horizontal  direc¬ 
tion. 


The  rotation  corrected  lip  image  is  obtained  by  multiplying 
R~ 1  with  the  coordinates  of  lip  pixel  locations,  as 

B  =  1’2-"jV  (22> 

where  {xn,yn)T  represents  the  cartesian  coordinates  of  the 
lip  pixel  locations,  and  (x’n.y'n)T  represents  the  cartesian 
coordinate  of  the  lip  pixel  locations  after  the  rotation  correc¬ 
tion.  Figure  2c  shows  the  orientation  of  the  lip  shape  after 
rotation  correction  and  scaling  of  lip  shown  in  Figure  2a. 

3.2.  Scaling  Problem 

The  scaling  problem  occurs  due  to  the  speaker’s  distance  to 
camera,  the  camera  zoom  factor  and  the  speaker’s  actual  lip 
dimensions.  In  this  case,  any  pixel  based  visual  feature  ex¬ 
traction  method  such  as  DCT  or  wavelet  transform  method 
which  utilizes  the  frequency  content  of  the  lip  image  may 
generate  inconsistent  (noisy)  observation  vectors.  To  over¬ 
come  this  problem,  we  propose  to  interpolate  every  lip  im¬ 
age  to  same  size,  N  x  M.  Figure  3  shows  the  scaling  prob¬ 
lem  example  for  two  different  speakers  and  the  lip  images 
of  them  after  interpolation  (scale  correction). 


\C  -  A/|  =  0, 


( 1 8)  3.3.  Shearing  (Uneven  Scaling)  Problem 


and  then  find  the  eigenvectors  (also  called  proper  vector  or 
characteristic  vector),  calculated  as 

C  ei  —  A,-  ei,  i  =  l,2  (19) 

where  ei  =  [ezi  evi]T.  The  eigenvector  belongs  to  largest 
eigenvalue  defines  the  rotation  angle  a,  as 

q  =  atan(e  j,i  /ezi ).  (20) 

Then  the  rotation  correction  matrix  R~]  can  be  written  as 


_  cos(a)  —sin(a) 
sin(a)  cos(a) 
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(21) 


Shearing  occurs  when  the  speaker’s  head  position  is  not  per¬ 
pendicular  to  camera  optical  axis.  For  example,  one  side  of 
the  lips  which  may  look  larger  than  the  other.  Solving  the 
shearing  problem  using  the  single  2D  image  information  is 
not  theoretically  possible.  There  can  be  various  practical 
approaches  to  minimize  the  shearing  effect  such  as  using 
the  symmetry  information  of  the  lips  may  enable  us  to  esti¬ 
mate  the  shear  matrix  by  utilizing  the  least  squares  estimate 
method  and  undo  the  shearing.  Figure  4  illustrates  a  typical 
example  of  a  shearing  effect  in  the  horizontal  direction. 

The  shearing  may  also  be  associated  with  the  accent  of 
a  speaker,  depending  on  certain  visimes.  Then,  the  similar 


question  can  be  posed  whether  shearing  conveys  a  linguistic 
information. 

4.  Visual  Speech  Feature  Extraction 

Lipreading  clearly  meets  at  least  two  practicable  criteria: 
It  mimics  human  visual  perception  of  speech  recognition, 
and  it  contains  information  that  is  not  always  present  in 
the  acoustic  signal  [3,4, 14-16].  Petajan  is  one  of  the  first 
researchers  who  built  a  lipreading  system  using  oral-cavity 
features  to  improve  the  performance  of  an  acoustic  ASR  sys¬ 
tem  [17].  Silsbee  et  al.  [18]  utilized  vector  quantization  (VQ) 
of  acoustic  and  visual  data  for  their  HMM  based  audio  and 
video  subsystems.  Teissier  et  al.  [19]  utilized  20  FFT  based 
1-bark  wide  channels  between  0  and  5  Khz  for  acoustic  fea¬ 
tures  and  inner  lip  horizontal  width,  inner  lip  vertical  height 
and  inner  lip  area  for  the  visual  features.  Chiou  et  al.  [20] 
utilized  active  contour  modeling  to  extract  visual  features  of 
geometric  space,  the  Karhunen-Loeve  transform  (KLT)  to 
extract  principal  components  in  the  color  eigenspace,  and 
HMMs  to  recognize  the  combined  video  only  feature  se¬ 
quences.  Potamianos  et  al.  [14,21]  used  Fourier  descrip¬ 
tor  magnitudes  for  a  number  of  Fourier  coefficients,  width, 
height,  area,  central  moments,  normalized  moments  as  con¬ 
tour  features,  image  transform  features,  and  hierarchical 
discriminant  features. 

In  order  to  utilize  audio-visual  ASR  for  natural  speech 
in  varying  lighting  conditions,  the  visual  front  end  algo¬ 
rithm  that  extracts  the  visual  features  must  satisfy  the  three 
criteria  presented  in  Section  1.  The  contour  based  feature 
described  in  Section  4.1  satisfy  step  (i)  in  the  Fourier  do¬ 
main  and  is  relatively  independent  of  step  (ii)  and  step  (iii). 
For  pixel  based  visual  feature  extraction  methods,  step  (i)  is 
explained  in  Section  3.  Steps  (ii)  and  (iii)  are  explained  for 
both  2D  DCT  based  visual  features  and  kurtosis  measure 
based  visual  features  which  are  described  in  Sections  4.2, 
and  4.3,  respectively. 

4.1.  AI-FDs  Based  Visual  Features 

In  general,  for  the  video  feature  extraction,  the  relationship 
between  observed  parametric  outer-lip  contour  data  x  and 
parametric  reference  data  x°  can  be  written  as, 

x[n]  =  Ax°[n  +  r]  +  b,  (23) 

where  A  represents  a  2  x  2  arbitrary  affine  matrix,  det(A)  ^ 
0,  that  may  have  scaling,  rotation,  and  shearing  affect,  b 
represents  a  2  x  1  arbitrary  translation  vector,  and  r  is  start¬ 
ing  point.  These  are  removed  in  the  Fourier  domain  [7, 22] 

The  video  feature  extraction  algorithm  extracts  twelve 
affine-invariant  Fourier  descriptors  (AI-FDs)  of  the  para¬ 
metric  outer  lip  contour  data  as  well  as  four  affine-invariant 
oral  cavity  features  which  are  width,  height,  ratio  of  width 
to  height,  and  outer  lip’s  inner  area  by  normalizing  the  next 
frame’s  corresponding  oral  cavity  features.  Dynamic  co¬ 
efficients,  which  are  used  as  a  video  observation  features, 
are  obtained  by  differencing  the  consecutive  image  sequence 
features. 

4.2.  Normalized  2D  DCT  Based  Visual  Features 

The  Discrete  Cosine  Transform  is  one  of  the  many  trans¬ 
form  methods  that  transforms  its  input  into  a  linear  combi¬ 
nation  of  weighted  basis  functions.  The  2D  DCT  on  a  NxN 


lip  image  can  be  written  as 

Y  =  CTX  C  (24) 

where  X  is  an  NxN  lip  image,  Y  contains  the  NxN  DCT 
coefficients,  and  C  is  an  NxN  transform  matrix  defined  as 

Cmn  =  kn  cosl^L],  where  (25) 


k 


n 


y/l/N  when  n  —  0, 
\fi/N  otherwise 


and  m,  n  =  0, 1, ...,  N-l.  Our  goal  is  to  extract  visaul  features 
satisfying  step  (ii)  and  step  (iii),  and  most  relevant  informa¬ 
tion  of  the  lip  shape  from  the  NxN  DCT  coefficients.  Let  1° 
and  I  be  lip  shape  images  which  differ  in  a  scale  and  shift 
factors  (lighting  condition),  i.e., 

I  =  al°  +  S,  (26) 


where  a  and  S  are  scale  and  shift  factors  in  the  acceptable 
range1  of  the  chrominance/luminance  space. 

From  Equation  25,  we  know  that  the  zeroth  coefficient 
of  the  DCT  transform  contains  the  DC  information  (5  in 
Equation  26)  which  doesn’t  convey  any  shape  information. 
It  is  also  known  that  DCT  is  a  linear  transform  and  the  scale 
factor  a  just  scales  all  the  DCT  coefficients.  So  normalizing 
all  the  coefficients  in  the  DCT  domain  by  a  coefficient  Ymn 
makes  the  DCT  transform  scale  independent.  Then,  35  co¬ 
efficients  from  the  lower  frequencies  are  selected  excluding 
the  DC  information.  Figure  5  shows  the  normalized  2D  DCT 
based  visual  feature  extraction  process. 


0  12  •  •  •  N 


Ixm  Observation  vector.  O 


Subset  of  2D  DCT  coefficients, 
where  m  is  the  number  of  (scale 
and  shift  invariant)  coefficients. 


Figure  5:  Normalized  2D  DCT  based  visual  feature  extraction. 


4.3.  2D  Kurtosis  Measure  of  the  Probability  Density  Distri¬ 
bution  of  the  DCT  Coefficients 

After  the  rotation  correction  and  size  normalization  of  the 
lip  image,  the  resulting  lip  image  is  divided  into  16  x  16 
sub-blocks  with  50%  overlapping  or  non-overlapping  sub¬ 
blocks,  and  then  the  two-dimensional  DCT  of  the  each 
block  is  calculated.  For  simplicity,  let  Y  be  the  matrix  of 
16xl6DCT  coefficients.  Y(0,0)  depends  only  on  the  chromi¬ 
nance/luminance  space  shift  shown  in  Equation  26,  and  con¬ 
veys  no  shape  information.  Thus,  the  Y (0, 0)  coefficient  is 
removed.  The  remaining  coefficients  are  now  only  chromi¬ 
nance  space  scale  dependent  (see  Equation  26).  We  remove 
the  dependency  on  the  chrominance  space  scale  by  calculat¬ 
ing  the  2D  kurtosis  of  the  frequency  profile  (probability  dis¬ 
tribution  of  DCT  coefficients)  of  each  block  in  the  lip  image 
discussed  in  the  following  sections.  Figure  6  shows  the  pixel 

'Reference  and  observed  lip  image  contents  are  dearly  visible  for  a 
range  of  a  and  <5. 
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Figure  6:  Illustration  of  FPM  visual  feature  extraction  (&,  is  an 
appearance  based  visual  coefficient  for  the  ith  lip  image  block). 


based  visual  front  end  process,  where  fc0,  kj,...,kn  are  co¬ 
efficients  for  the  pixel  (appearance)  based  visual  features  of 
the  lip  image.  In  this  work,  we  will  refer  these  pixel  based 
features  as  frequency  profile  measures  (FPMs),  which  are 
2D  kurtosis  measures  of  the  probability  density  distribution 
of  the  DCT  coefficients. 

In  the  theory  of  probability,  the  classical  measure  of  the 
non-Gaussianity  of  a  random  variable  is  the  kurtosis  mea¬ 
sure.  Kurtosis  measures  the  departure  of  a  probability  dis¬ 
tribution  from  the  Gaussian  (normal)  shape2.  Kurtosis  is 
dimensionless  ratio,  and  greater  than  zero  for  most  non- 
Gaussian  random  variables1.  Specifically,  for  a  given  2D  im¬ 
age  block  function  I(n.  m),  where  m,n  =  0,1,...,  Af,  the 
corresponding  2D  DCT  coefficients  Y ( x ,  y )  can  be  obtained 
as  described  in  Section  4.2,  where  x  and  y  are  the  spatial 
frequencies  in  the  DCT  domain.  The  high-frequency  DCT 
coefficients4  are  discarded  to  minimize  the  video  noise  effect 
which  is  discussed  in  Section  4.3.1.  The  rest  of  the  lower  fre¬ 
quency  DCT  coefficients  Y ( x ,  y)  for  x,  y  -  1,  2, . . .  N/ 2,  are 
normalized  to  form  the  bi-variate  probability  density  func¬ 
tion  p(x,  y).  Using  the  notation  of  [23],  for  a  given  univariate 
random  variable  x  with  marginal  probability  mass  function 
p(x),  mean  px,  and  existing  finite  moments  up  to  the  fourth 
moment,  then,  the  univariate  kurtosis  is  defined  by: 

kurt(x)  -  fa  =  (27) 

ml, 

where  m2  and  are  the  second  and  fourth  central  mo¬ 
ments,  respectively.  In  general,  the  kth  central  moment  is 
defined  by: 

mk  =  E[(x  -  px)k ]  =  J2(x  -  px)kp(x),  (28) 

X 

where  marginal  density  function  of  x  is 

p(x)  =  X>(x,!/),  (29) 

v 

where  E  denotes  the  probability  expectation  [24].  If  xi  and 
x^  are  two  independent  random  variables,  then  kurtosis  has 
the  following  linearity  properties: 

kurt(x\  4-  X2)  =  kurt(x\ )  4-  kurt(x2)  and  (30) 
kurt(ax) )  =  a4kurt(xj)  (31) 

where  is  a  is  an  arbitrary  scalar.  Clearly,  any  scale  factor 
in  Equation  27  cancels  out.  Let  W  be  a  p-dimensional  ran¬ 
dom  vector  with  finite  moments  up  to  the  fourth,  and  p  and 

2The  smaller  the  kurtosis.  the  flatter  the  top  of  the  distribution. 

1Kurtosis  is  3  for  any  univariate  Gausain  distribution. 

4  The  upper  half  of  the  DCT  coefficients  are  discarded. 
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Figure  7:  In  search  of  the  lip  region  type  with  96x64  pixel  size 
to  extract  visual  speech  features:  a)  exact  lip  region,  b)  exact 
rectangular  lip  region,  c)  extended  rectangular  lip  region. 


T  be  the  mean  vector  and  covariance  matrix  of  W,  respec¬ 
tively.  Mardia  [25]  proposed  the  p-dimensional  multivariate 
kurtosis  as: 

fa,P  =  E{(W-p)TY-\W-p)]\  (32) 

where  T  denotes  the  transpose  of  a  vector.  Zhang  [23]  used 
2D  kurtosis  of  random  vectors  for  a  sharpness  measure  of 
Scanning  Electron  Microscopy  (SEM)  images.  The  2D  kur¬ 
tosis  02,2  is  calculated  by 

02,2  —  [74, 0+70, 4+272,2+4p(p7212— 71,3-73, l)]/(l—p2)2, 

(33) 

where 

7 k,t  =  ~  P*)k(v  ~  ~  P*)2 

x  y  x 

PW)t/2(53(x-^)2p(x))'/2]:  (34) 
v 

°xy  E[{x  —  px)(y  ~  fly)],  Ox  =  I?[(x  —  px)  ],  (35) 

and 

P=  oly/(oxOy).  (36) 

The  2D  kurtosis  measure,  02,2,  is  dimensionless  and  scale 
and  shift  invariant  as  seen  in  Equation  33.  In  this  work, 
the  2D  kurtosis  defined  in  Equation  33  is  calculated  using 
the  probability  density  distribution  of  the  DCT  coefficients 
of  the  image  block  function  I (n,  m).  We  will  refer  to  the 
02,2  measure  as  the  frequency  profile  measure  (FPM)  of  an 
image  block.  The  image  blocks,  which  have  zero  marginal 
variances  of  x  or  y,  are  discarded  for  02,2  calculation,  and 
their  FPMs  are  arbitrarily  assigned  to  the  74,0  value  when 
crx  /  0  and  av  =  0,  to  the  70,4  value  when  crv  ±  0  and 
ox  —  0,  and  to  -1  when  both  ov  —  0  and  crx  =  0. 

4.3.1.  Reducing  the  Effect  of  Video  Noise  in  FPM  Visual  Fea¬ 
tures 

It  is  known  that  the  low-frequency  coefficients  in  the  DCT 
of  the  video  signal  contain  the  large  details  and  the  high- 
frequency  coefficients  contain  the  finer  details  of  the  im¬ 
age.  Video  noise5  is  clearly  represented  in  the  DCT  coef¬ 
ficients  and  using  the  full  spectrum  of  the  image  leads  to 
noisy  (distorted)  visual  features.  That  is  why  some  of  the 
high-frequncy  DCT  coefficients  were  discarded  in  the  calcu¬ 
lation  of  FPM  of  the  image  blocks  described  in  Section  4.3. 
The  pixel  based  visual  front  end  research  requires  further 
investigation  on  how  to  minimize  the  effects  of  video  noise 
and  the  dependence  of  FPM  on  the  selection  of  the  cut-off 
frequency. 

'Motion  blur,  coding  artifacts,  quantization  errors,  electronic  noise, 
etc.,  are  considered  to  be  video  noises. 


(a)  (b)  (e) 


Figure  8:  In  search  of  the  lip  region  type  with  80x48  pixel  size 
to  extract  visual  speech  features:  a)  exact  lip  region,  b)  exact 
rectangular  lip  region,  c)  extended  rectangular  lip  region. 


(a)  (b) 


Figure  9:  Effect  of  interpolating  on  pixel  based  visual  feature 
extraction:  a)  re-interpolated  from  96x64  pixels  to  60x60  pixels, 
b)  re-interpolated  from  80x48  pixels  to  60x60  pixels. 


Table  1:  Visual-only  recognition  accuracy  for  connected  digit 
task  using  the  subset  of  the  normalized  2D  DCT  features,  FPM 
features,  and  concatenated  AI-FDs  and  FPM  features.  (LR: 
lip  region,  R-LR:  rectangular  LR,  ER-LR:  extended  R-LR,  bl.: 
blocks). _ 


Sub.  of  norm.  2D  DCT  using 

TR  V% 

TS  V% 

exact  LR  with  ini.  80x48  pixels 

22.40 

21.60 

exact  LR  with  ini.  96x64  pixels 

23.00 

20.80 

R-LR  with  ini  80x48  pixels 

24.60 

17.20 

R-LR  with  ini.  96x64  pixels 

24.00 

19.60 

ER-LR  with  ini.  80x48  pixels 

22.80 

24.40 

ER-LR  with  ini.  96x64  pixels 

21.60 

21.60 

FPMs  using 

exact  LR  with  overlapping  bl. 

41.80 

19.60 

exact  LR  with  non-overlapping  bl. 

35.00 

24.00 

R-LR  with  overlapping  b). 

38.80 

23.60 

R-LR  with  non-overlapping  bl. 

34.60 

22.00 

ER-LR  with  overlapping  bl. 

39.00 

22.00 

ER-LR  with  non-overlapping  bl. 

34.20 

19.60 

Concat.  AI-FDs  and  FPMs  using 

only  AI-FDs 

18.55 

21.33 

exact  LR  with  overlapping  bl. 

19.20 

18.40 

exact  LR  with  non-overlapping  bl. 

17.60 

18.40 

R-LR  with  overlapping  bl. 

18.40 

20.40 

R-LR  with  non-overlapping  bl. 

17.40 

18.40 

ER-LR  with  overlapping  bl. 

18.40 

17.60 

ER-LR  with  non-overlapping  bl. 

17.80 

18.80 

5.  Visual-Only  Experimental  Setup  and 
Results 

This  paper  discusses  visual  modality  speech  recognition 
(lipreading)  system  setup  and  results.  The  HMM  states  were 
modeled  with  continuous  density  Gaussians  with  single  mix¬ 
ture  components.  The  aim  of  this  work  is  to  investigate  an 
affine  and  lighting  conditions  invariant  visual  feature  ex¬ 
traction  method.  Therefore,  the  HMM  model  structure  was 
kept  basic.  The  HMM  implementation  was  word  level,  left- 
to-right  with  no  skip  transitions  with  ten  (eight  emitting  and 
two  non-emitting)  states,  and  diagonal  covariance  Gaussian 
mixture  components  since  we  assume  that  the  coefficients  in 
the  observation  vectors  are  naturally  independent.  All  the 
model  parameters  were  initialized  using  the  Viterbi  train¬ 
ing  algorithm  and  re-estimated  using  the  Baum- Welch  re¬ 
estimation  algorithm.  Viterbi  recognition  (dynamic  pro¬ 
gramming)  algorithm  is  utilized  for  the  recognition. 

The  Clemson  University  Audio-visual  Experimental 
(CUAVE)  connected  and  continuous  audio-visual  digit 
database,  which  is  a  thirty  six  subject  dataset,  was  utilized 
for  the  experiment.  The  visual-only  experimental  results 
are  presented  for  a  connected  audio-visual  digit  recognition 
task.  The  following  visual  features  from  exact  lip  region, 
exact  rectangular  lip  region,  and  generous  rectangular  lip 
region  as  shown  in  Figures  8  and  9  are  utilized  in  the  visual- 
only  speech  recognition  system. 

1.  Subset  of  normalized  2D  DCT  features 

2.  FPM  features 

3.  Al-FD  features 

4.  Concatenated  AI-FDs  and  FPM  features 

The  subset  of  the  36  speaker  dataset,  containing  15 
speakers  each  is  uttering  five  times  0-9.  The  speakers  are 
split  into  training  (TR)  and  testing  (TS)  set  of  ten  and  five 
subjects,  respectively,  leading  to  speaker  independent  visual 
only  recognition  system.  The  results  are  shown  in  Table  1. 

6.  Concluding  Remarks  and  Future  Work 

Table  1  shows  the  visual-only  connected  digit  recognition 
results,  where  TR  corresponds  to  training  set  performance 
and  TS  corresponds  to  test  set  performance,  for  various  vi¬ 
sual  features  discussed  in  this  paper.  The  subset  of  the  nor¬ 
malized  2D  DCT  features  based  on  the  training  set  results 
from  exact  rectangular  lip  region  gives  better  results  than 
the  exact  lip  region  and  extended  lip  region  (see  in  Figure 
9).  Another  observation  is  that  slight  change  in  lip  image 
content  due  to  the  linear  interpolation  has  effects  on  the  sys¬ 
tem’s  performance. 

In  the  results  obtained  using  FPM  features,  the  train¬ 
ing  set  performance  is  much  better  than  the  test  set  per¬ 
formance.  Similar  performance  behavior  was  observed  for 
a  speaker  dependent  recognition  task.  Therefore,  we  con¬ 
clude  that  FPM  based  features  are  highly  video  noise  sen¬ 
sitive.  The  overlapping  block  based  FPM  features  outper¬ 
formed  the  non-overlapping  block  based  FPM  features  sig¬ 
nificantly  in  the  training  set.  Among  the  three  different  lip 
regions  shown  in  Figure  9,  the  exact  lip  region  with  over¬ 
lapping  blocks  method  outperforms  the  results  of  outer  two 
regions. 

In  the  results  obtained  using  concatenated  AI-FDs  and 
FPMs.  the  training  set  and  test  set  performances  are  close 
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to  each  other  and  worse  than  FPMs-only  results.  Therefore, 
we  conclude  that  each  feature  should  be  treated  as  a  sepa¬ 
rate  stream  and  weighted  properly  to  bring  the  additional 
information  from  one  another.  Similarly,  the  slight  perfor¬ 
mance  increase  due  to  the  overlapping  block  of  FPM  fea¬ 
tures  over  non-overlapping  block  based  FPM  features  can 
be  noticable. 

We  also  report  that  the  number  of  mixtures  in  the  Gaus¬ 
sian  mixture  model  (GMM)  selection  and  teh  number  of 
states  in  the  silence  model  affects  the  performance  of  visual- 
only  system.  For  example,  setting  GMM  to  twelve  and  us¬ 
ing  embedded  training  of  the  FPM  based  visual  only  system 
achieved  98%  recognition  accuracy  on  the  training  set,  but 
about  16%  on  the  speaker  independent  test  set  (which  is  less 
than  the  result  of  single  GMM  reported  in  Table  1.  The 
similar  behavior  is  observed  for  the  speaker  dependent  set. 
That  is,  the  system  is  being  well  trained  with  the  FPM  fea¬ 
tures,  but  the  both  test  sets  are  behaving  like  an  unmatched 
system  due  to  the  resulting  noisy  observations. 

We  conclude  that  visual  noise  is  an  important  factor  in 
visual  speech  feature  extraction,  and  overlapping  local  im¬ 
age  block  based  FPM  features  outperform  normalized  2D 
DCT  features,  AI-FD  features,  and  concatenated  AI-FDs 
and  FPM  features.  Future  work  will  include  initial  lip  seg¬ 
mentation  for  the  Bayesian  framework  training  and  further 
study  on  the  noise  robust  FPM  feature  extraction. 
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Abstract 

Automatic  lip-reading  has  been  focused  as  a  complimentary 
method  of  automatic  speech  recognition  in  noisy  environments. 
One  of  the  most  competitive  lip-reading  algorithms  is  the  image 
transform  based  lip-reading  (ITLR)  algorithm.  However.  1TLR 
has  severe  performance  degradation  under  illumination  variations. 

RASTA  is  a  kind  of  inter-frame  filtering  method.  It  is  used  for 
rejecting  stationary  and  convolutional  noise  in  speech  signal 
processing.  In  this  paper,  we  apply  RASTA  approach  to  ITLR 
and  analyze  the  performance  of  this  method.  We  propose  two 
merging  techniques  of  pre-integration  (PRE-I)  and  post¬ 
integration  (POST-I).  In  PRE-I  RASTA,  inter-frame  filtering  is 
performed  ahead  of  the  image  transform  process.  In  POST-I, 
inter-frame  filtering  is  done  after  the  image  transform  process. 
We  also  compare  the  effectiveness  of  high-pass  filtering  and 
band-pass  filtering  as  inter-frame  filtering. 

Experimental  results  show  that  pre-integration  is  very  effective 
to  reject  illumination  variances.  And  it  is  observed  that  high-pass 
filtering  is  enough  to  enhance  the  performance  of  lip-reading. 

1.  Introduction 

Recently,  researches  on  automatic  lip-reading  using  the  video 
sequence  of  the  speaker’s  mouth  have  attracted  significant 
interest.  Automatic  lip-reading  under  noisy  environments  is  very 
effective  in  compensation  for  the  decrease  of  speech  recognition 
rate  with  an  audio-only  speech  recognition  (ASR)  system  [1], 
The  bimodal  based  on  audio-visual  information  is  an  important 
part  of  the  human-computer  interface  (HCI).  We  allow  more 
weighting  value  to  visual  data  than  to  audio  one  under  a  bad 
SNR  but,  on  the  contrary,  more  to  audio  data  than  to  visual  one 
under  a  clean  SNR  [2],  Under  noisy  circumstances,  this  bimodal 
approach  has  been  a  good  alternative  showing  superior 
recognition  rate  to  audio-only  ASR  system. 

In  this  paper,  we  concentrate  on  the  image  transform  based 
approach  for  automatic  lip-reading  (ALR)  for  bimodal  speech 
recognition  system.  This  approach  is  known  to  be  superior  to  a 
lip-contour-based  method  for  visual-only  HMM  recognition  tasks. 
However,  while  the  lip-contour  based  approach  needs  only 
several  visual  data,  for  example,  outer,  inner  lip  contour  and  lip 
width,  the  image-transform-based  approach  requires  much  larger 
visual  feature  vectors  since  it  is  based  on  the  whole  transformed 
image  data  of  the  speaker’s  mouth.  Thus,  for  a  fast  algorithm,  the 
necessity  to  reduce  those  data  size  has  arisen. 

To  reduce  the  dimensionality  of  feature  vectors,  principal 
components  analysis(PCA)  has  been  suggested  as  a  good  method, 
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which  is  based  on  linearly  projecting  the  image  space  to  a  low 
dimensional  feature  space  [3],  By  the  way,  ITLR  has  the  problem 
of  robustness.  Under  varying  illumination,  the  observed  image 
sequences  are  suffered  from  rapid  performance  degradation. 
Illumination  variation  from  the  inconsistency  of  training  and  test 
conditions  interferes  the  recognition  process  such  as  exact  feature 
extraction.  This  interference  causes  a  mismatching  between  the 
correct  word  and  the  related  feature  model  and,  after  all,  reduces 
the  recognition  rate.  Our  preliminary  experiment  in  lip-reading 
system  showed  that  even  only  a  small  amount  of  intensity 
variation  caused  large  degradation  of  lip-reading  performance  [4]. 

To  tackle  those  problems  we  propose  the  inter-frame  filtering 
method,  which  is  very  similar  with  RASTA  filtering  in  automatic 
speech  recognition  (ASR).  According  to  reference  [5],  RASTA 
filtering  is  very  successful  in  ASR  under  convolutional  noisy 
environment.  We  propose  two  kinds  of  integration  methods,  pre¬ 
integration  and  post-integration.  We  examine  usefulness  of  the 
inter-frame  approach  with  our  own  lip-reading  system. 

In  section  2,  we  briefly  describe  the  algorithm  for  real-time 
automatic  visual-only  lip-reading  system  and  mention  about  the 
necessity  of  the  proposed  method.  Section  3  describes  methods 
to  diminish  the  illumination  noise  for  the  improved  recognition 
rate.  Finally,  section  4  presents  experimental  results. 

2.  Baseline  system  :  visual-only  HMM-based  lip- 
reading  system 

To  develop  a  robust  lip-reading  algorithm,  we  implemented  an 
automatic  image  transform  based  lip-reading  system  using  HMM 
based  word  model.  Figure  1  shows  the  overall  block  diagram  of 
the  implemented  system  based  on  the  proposed  algorithm.  Given 
image  sequence  containing  speaker’s  mouth,  the  overall  process 
to  extract  the  visual  feature  data  consists  of  two  sub-processes. 
One  is  ROI  (region  of  interest)  extraction  process  and  the  other  is 
feature  parameter  extraction  process. 

2. 1  ROI  extraction 

Since  lip-reading  is  based  on  the  visual  information  of  moving 
lip,  extraction  of  appropriate  interesting  regions  containing  only 
moving  lip  area  is  important.  ROI  extraction  from  each  image 
frame  of  given  sequence  is  performed  before  feature  extraction. 
As  shown  in  figure  1,  ROI  extraction  process  consists  of  three 
steps;  1)  gray-level  transformation,  2)  masking  filtering  and  3) 
binary-level  transformation. 

To  find  lip  area  efficiently,  color  image  is  first  transformed 
into  gray  level  image  and  then  into  binary-level  image. 
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Figure  1.  Block  diagram  of  the  proposed  method  for  real¬ 
time  visual-only  HM M  based  lip-reading  system 

Both  lip-ends  of  moving  lip  are  extracted  from  this  binary- 
level  image  by  applying  Y-projection  and  then  X-projection.  The 
vertical  and  horizontal  center  of  speaker’s  mouth  is  obtained 
from  these  X,  Y-projection.  Then,  the  square  pixel  window  of 
ROl  is  constructed  around  speaker’s  mouth.  Since  the  lip  width 
information  of  moving  lip  is  important,  we  keep  the  width  of 
ROl  obtained  at  the  first  frame  of  each  word  to  the  last  frame  of 
that  word.  During  the  ROl  extraction  process,  'masking  filter’  is 
applied  to  diminish  the  unbalanced  illumination  of  facia!  area 
from  various  lighting  source. 


sequence  are  used  for  HMM  based  word  modeling.  Our 
automatic  lip-reading  system  uses  continuous  density  HMMs  as 
a  means  of  statistical  pattern  matching.  The  HMM  observation 
probabilities  are  modeled  as  multi-dimensional  Gaussian 
mixtures  with  diagonal  covariance  matrices.  For  the  specific  lip- 
reading  recognition  tasks  considered  in  this  paper,  we  use  whole 
word,  3-6  state,  left-to-right  models  with  3-8  mixtures  per  state. 
All  HMM  parameters  are  estimated  by  maximum  likelihood 
Viterbi  training. 


3.  Inter-frame  filtering 

One  of  ASR  problems  is  the  robustness.  The  performance  of 
ASR  is  commonly  worse  in  noisy  environments.  In  general, 
noise  is  classified  into  additional  and  convolutional  noise. 
RASTA  filtering  is  one  of  methods  used  in  ASR  for  preventing 
the  degradation  of  ASR  performance.  RASTA  is  the  abbreviation 
of  ‘relative  spectral  smoothing’.  It  was  found  that  filtering  time 
trajectories  could  compensate  greatly  for  the  effect  of  the 
convolutional  noise  induced  by  communication  channel[5]. 
RASTA  filtering  is  performed  with  bandpass  filter.  In  RASTA 
filtering  slow  varying  components,  corresponding  to  the 
frequency  characteristics  of  communication  channel,  are 
suppressed.  The  low-pass  filtering  helps  to  smooth  some  of  the 
fast  frame-to-frame  spectral  change  present.  The  commonly  used 
bandpass  filter  is  as  follows. 

H(z)  =  0.\z4—  ~z3~2z~4  (I) 

1  -0.98z~ 

Based  on  these  results,  we  discuss  how  inter-frame  filtering  is 
applied  to  lip-reading  problems  to  enhance  the  performance  of 
automatic  lip-reading. 

3.1  Integration  of  inter-frame  filtering  with  lip-reading 
system 


2.  2  Feature  extractions 

To  reduce  the  visual  feature  parameter  size,  each  ROl  is 
downsampled  into  a  16  x  16  pixel  window  for  fast  algorithm. 
This  operation  is  necessary  not  only  to  reduce  the  feature  data 
size  but  also  to  normalize  the  difference  between  each  ROl  size 
due  to  variations  such  as  speaker’s  lip  widths  and  the  distances 
from  camera. 

To  reduce  the  parameter  size,  dimensionality  of  visual  feature 
vector,  PCA  (principal  component  analysis)  is  applied,  PCA  is 
known  as  a  simple  method  to  implement  and  to  guarantee  good 
performance  in  automatic  lip-reading  [6],  And,  we  use  lip¬ 
folding  technique  before  PCA  process.  Lip-folding  is  based  on 
the  symmetric  property  of  lip  along  the  vertical  axis.  Lip-folding 
makes  16  x  16  image  size  to  half  size  of  8  x  16.  The  mean  half¬ 
sized  image  needs  smaller  principal  components  to  represent  it 
than  the  original  unfolded  one.  Additionally,  the  mean  image 
compensates  the  illumination  unbalance  between  the  left  lip  area 
and  the  right  lip  area  and.  therefore,  shows  robustness  under 
various  lighting  conditions[7). 


According  to  original  work  of  Hermansky,  RASTA  filtering  is 
applied  to  speech  feature  vector  (SFV)  sequence  after  obtaining 
SFVs.  The  RASTA  filter  is  a  kind  of  bandpass  filter  to  reject 
slow  and  fast  varying  components.  In  our  lip-reading  system, 
feature  extraction  processing  is  PCA  and  the  feature  parameters 
are  projection  values  of  original  image  into  most  important  axis. 
Thus,  we  can  integrate  inter-frame  filtering  after  PCA  in  our  lip- 
reading  system,  a  simple  imitation  of  ASR  structure  adopting 
RASTA  filtering.  We  call  this  approach  as  post-integration  (Post- 
I).  Figure  2  shows  the  block  diagram  of  Post-I  method. 

On  the  other  hand,  our  AV  database  (DB)  was  recorded  at 
various  lighting  conditions  with  illumination  not  regulated  when 
visual  DB  was  recorded.  Thus,  we  may  think  that  our  AV  DB 
was  originally  suffered  from  illumination  noise.  If  the 
illumination  noise  was  variant  and  dynamic,  the  result  of  PCA 
may  include  the  influence  of  illumination  noise.  So,  the  m 
important  axes  would  contain  the  components  induced  by 
illumination  noise.  This  concept  makes  us  change  the  order  of 
PCA  and  inter-frame  filtering.  Figure  3  shows  the  second 
integration  method  of  pre-integration  (Pre-I). 


2.  3  HMM  based  word  recognition 

For  every  video  field,  a  static  observation  feature  vector  is 
acquired  and  those  vectors  obtained  from  the  given  video 


3.2.  Filters  for  inter-frame  filtering 

The  band-pass  filter  used  in  ASR  is  shown  in  eq.  (1).  It  is  not 
impossible  to  use  this  filter  for  filtering  image  sequence.  It’s 
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Figure  2.  Post-integration  method(Post-I). 
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Figure  3.  Pre-Integration  method(Pre-I). 


because  the  sampling  frequency  is  very  low  in  case  of  image 
capture  operation  compared  with  speech  sampling.  For  speech 
signal  100  feature  vectors  per  second  is  common.  But,  in  our 
case,  sampling  frequency  for  image  signal  is  30Hz/second.  So, 
we  used  very  simple  HR  filter  for  inter-frame  filtering  as  follows. 


High-pass  filter : 

Y\n,  m]  =  0.9858  ■  (X,  [n,  m]  -  X,_}  [i n,m ]) 
+  0.9716 

Low-pass  filter : 

Yt[n,m]  =  0.8638  •(Ar([«,w]  +  X,  ^[n,m]) 
+  0.7257 -Y,_,[n,  m] 


(2) 

(3) 


Both  filters  are  IIR(1,1)  filters  designed  using  MATLAB  tool. 
Figure  4  shows  the  original  image  sequence  and  the  filtered 
image  sequences. 


Table  I.  Experimental  environments. 


Camara 

SONY  digital  home  video  camera 

Frame  rate 

30  frames/sec 

Words 

22  Korean  words 
selected  from  the  command  menu 
for  car  navigation  system 

Training  speakers 

52  male  speakers 

Test  speakers 

1 8  male  speakers  different  from 
training  speakers 

Recording  condition 

All  recording  are  performed  at 
different  rooms  at  different  time 

Figure  5.  Some  examples  of  our  database  recorded. 


4.  Experimental  Environments  and  Results 

4.1  Experimental  environments 

The  experimental  environment  is  shown  in  table  1.  The 
database  is  composed  of  22  Korean  words  spoken  by  70  speakers. 
Figure  5  shows  sample  images  of  the  AV  database.  As  shown  in 
the  figure,  our  database  recorded  at  different  rooms  and  at 
different  time,  reveals  illumination  variations. 

4.2  Experimental  results 


IMP*  iff*-”  HSF~' 

■up""—*  jrffe"'" . (.WaE'tKb 

m  WF  WF  wF  f$5& 

&  ^  iSr  ^ 

(a)  Original  image  sequence  (16  x  16) 


(c)  Band-pass  filtered  image  sequence  (8  x  16) 

Figure  4.  Inter-frame  image  filtering  results 


In  this  subsection,  we  describe  the  results  of  two  proposed 
integration  methods;  Pre-I  and  Post-I,  in  the  point  of  feature 
vector  dimension  and  recognition  results.  Table  2  shows  the 
dimension  of  features  in  Pre-I  and  Post-I  integrations.  From  table 
2,  it  is  observed  that  post  integration  method  is  very  effective  in 


Table  2.  Comparison  of  feature  dimensions  in  cases  of 
Pre-I  and  Post-I _ _ _ _ _ 

f  Filter  PCA  90%  I  PCA  95% 


Bandpass 

Integration  Highpass 
NonFilter 


51 


reduction  of  principal  component  numbers.  The  reason  for  this 
achievement  could  be  that  the  pre-filtering  rejects  the  influence 
of  illumination  noise  before  PCA  process. 

The  other  observation  is  that  the  low-pass  filtering  does  not 
reduce  the  feature  vector  dimension.  This  result  is  not  remarkable, 
for  the  sampling  rate  of  image  signal  is  much  lower  than  that  of 
speech  signal.  Anyway,  using  the  post-integration,  the  feature 
vector  dimension  is  reduced  up  to  approximately  30%.  The 
recognition  results  are  shown  in  figure  6  and  7.  From  these  two 
figures  we  can  observe  the  following  facts. 

1)  The  post-integration  doesn't  improve  the  lip-reading 
performance.  It  makes  the  lip-reading  performance  worse. 
But  the  pre-integration  enhance  the  recognition  rate  of  the 
lip-reading  system.  This  fact  is  the  different  point 
compared  with  the  ASR. 

2)  The  band-pass  filtering,  especially  low-pass  filtering  is 
not  decisive  to  increase  the  recognition  rate.  In  other 
words,  high-pass  filtering  is  enough  to  the  lip-reading 
system.  As  discussed  above,  it’s  because  the  sampling 
rate  of  video  data  is  high  when  we  consider  the  rate  of  lip 
movements  in  speaking. 

It  is  obvious  that  pre-integration  of  inter-frame  filtering  is  very 
effective  in  automatic  lip  reading.  Pre-integration  not  only 
reduces  the  dimension  of  feature  space  but  also  improves  the 
recognition  rate  of  image-based  lip-reading  system. 

5.  Concluding  Remarks 

In  general,  lip-reading  performance,  especially  image 
transform  based  lip-reading,  is  very  sensitive  to  illumination 
variance.  So.  it  is  necessary  to  develop  the  robust  version  of  lip 
reading  to  use  automatic  lip-reading  in  real  service  environments. 

In  this  paper,  we  proposed  inter-frame  filtering  approach  as 
one  of  robust  lip-reading  methods  and  analyzed  the  performance 
of  the  proposed  methods.  From  our  experimental  results  we 
showed  that  pre-integration  of  inter-frame  filtering  enhanced  lip- 
reading  performances.  The  achievements  are  as  follows. 

1 )  Inter-frame  filtering  reduced  feature  vector  dimension. 

2)  Inter-frame  filtering  improved  the  recognition  rate  of 
automatic  lip  reading. 

In  the  future  work,  we  will  enlarge  our  AV  database  and  study 
more  robust  methods  so  that  automatic  lip-reading  can  be  used  in. 
real  environments 


PCA  90%  PCA  95% 

□  Bandpass  DHighpass  □  Nofilter  | 

Figure  7.  Recognition  results  of  pre-integration 
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Abstract 

This  paper  examines  a  new  robust  color  scheme 
and  an  adaptive  object  tracking  technique. 
There  are  several  popular  color  schemes  used  in 
face  tracking  which  include  Normalized  RGB, 
Hue,  Saturation,  and  Hybrid  type  of  colors. 
Hybrid  color  schemes  provide  improved  results 
compared  to  any  single  color  scheme  technique. 
Extensive  experiments  show  the  new  robust 
Hybrid  color  scheme  produced  superior  results 
in  various  lighting  conditions.  In  conjunction 
with  the  robust  hybrid  color  scheme  to  track 
head  movements  a  supporting  algorithm  was 
needed  to  approximate  the  random  path  of  the 
head  movement.  Kalman  filter  is  a  famous 
estimation  technique  in  many  areas  to  predict 
the  route  of  moving  object.  We  tested  and 
developed  a  random-walk  Kalman  filter  to  track 
unpredictable  and  fast  moving  objects.  The 
random-walk  Kalman  filter  tolerates  for  tracking 
of  quick  random  movements  made  by  a  person, 
which  was  not  accommodated  by  linear  tracking 
techniques. 

1.  Introduction 

For  many  computer  vision  applications,  such  as 
automatic  speech  recognition,  3D  animation,  and 
surveillance  a  robust  and  reliable  automatic  head 
tracking  technique  in  various  unmodified 
environments  is  vital.  Recent  research  in  this 
area  shows  great  progress  and  promise.  There 
are  many  approaches  to  track  the  head  position 
on  an  image  sequence.  Some  tracking  modules 
are  based  on  feature  invariant,  which  is  used  to 
find  out  a  structural  feature,  some  are  based  on 
template  matching,  which  is  using  a  stored 
pattern  to  track  head  position  (pattern  can  be  2D 
or  3D).  Others  include  appearance-based 
method,  which  is  using  a  trained  model  from  a 
set  of  images  to  capture  the  representative 
variability  of  facial  appearance.  In  this  paper  we 
explore  a  combination  of  a  hybrid  color  scheme 
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module  and  a  random-walk  Kalman  filter  to 
track  random  head  movement  in  a  variety  of 
environments. 

Many  researchers  have  exploited  the  relative 
uniqueness  of  skin  color  to  track  faces.  Human 
skin  color  has  been  used  and  proven  to  be  an 
effective  feature  in  many  applications.  A 
weakness  of  these  systems  is  their  heavy  reliance 
upon  skin  color  that  forbids  skin-colored  objects 
in  the  background  and,  more  importantly,  forbids 
the  subject  from  turning  around  so  that  the  back 
of  his  head,  rather  than  this  face,  is  visible  [1]. 

Color  image  histogram  is  an  effective  method  for 
the  purpose  of  object  recognition,  segmentation 
or  tracking.  Color  histograms  are  relatively 
invariant  to  many  complicated,  non-rigid 
motions  like  translation,  rotation  about  the 
imaging  axis,  small  off-axis  rotations,  scale 
changes  and  partial  occlusion.  The  color 
histogram  percentile  features  are  useful  to 
recognize  the  pattern  of  human  face  with 
relatively  low  complexity.  Many  methods  have 
been  proposed  to  build  a  skin  color  model.  In 
this  paper  we  proposed  a  new  Hybrid  color 
scheme  with  the  support  of  additional  Hue  and 
Saturation  analysis  features  that  provide 
noticeable  improvement  in  performance  in 
various  lighting  conditions. 

The  Kalman  filter  is  an  optimal  estimator  to 
predict  the  next  position  of  a  moving  object.  It 
addresses  the  general  problem  of  trying  to 
estimate  parameters  of  interest  from  indirect, 
inaccurate  and  uncertain  measurements. 
However,  general  purpose  of  Kalman  filter  is 
only  working  well  under  slight  movement  and 
gradual  speed  on  the  image  sequence.  We  need 
adaptive  methods  to  overcome  this  problem. 

Section  2  will  cover  the  color  performance 
analysis  in  head  tracking  to  show  the  improved 
result  of  our  new  color  scheme  compared  to 
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result  of  other  systems.  Section  3  covers  random- 
walk  Kalman  filter  to  trace  correct  location  of 
unpredicted  and  rapidly  moving  object.  Finally, 
section  4  will  provide  conclusion  of  experiment 
result. 

2.  Analysis  of  Color  Scheme  for  Head 
Tracking 

In  the  RGB  model,  a  color  is  expressed  in  terms 
that  define  the  amounts  of  Red,  Green  and  Blue 
light  it  contains.  Normalized  color  space  is  a 
popular  color  representation  to  specify  human 
skin  color  patterns.  Since  under  normal  lighting 
conditions  the  brightness  of  the  face  is  not 
important  for  characterizing  skin  colors,  we  can 
represent  skin-color  in  the  chromatic  color  space. 
Chromatic  colors,  known  as  “pure”  colors  in  the 
absence  of  brightness,  are  defined  by  a 
normalization  process  [2], 

Cr  =  R  /  (R  +  G  +  B) 

Cb  =  B  /  (R  +  G  +B) 


We  attempted  to  find  a  new  color  scheme  that  is 
robust  enough  for  various  light  and  background 
conditions.  From  our  previous  experiment, 
Stanford  scheme  showed  a  better  result 
compared  to  other  methods.  But  in  addition  to 
this  scheme,  the  characteristic  of  insensitivity  to 
illumination  is  required  for  a  practical  and 
dependable  tracking  module.  A  new  Hybrid 
color  scheme  that  utilizes  additional  Hue  and 
Saturation  features  is  the  one  we  chose  to 
achieve  this  goal. 

The  research  was  executed  with  various 
sequences  of  images  under  different  light 
condition,  background,  and  persons.  For  the 
objective  comparison  of  result,  all  of  four 
sequences  were  obtained  from  Vision  lab 
website  of  Stanford  University.  Person  in  a 
sequence  is  always  inside  of  frame  by  controlling 
the  camera  movement.  These  sequences  include 
different  races,  light  condition  and  background. 
Importantly,  linear  prediction  technique  was 
exploited  to  predict  next  head  position  for  this 
test. 


Even  though  the  most  common  way  of 
representing  color  is  through  the  RGB  color 
space.  In  this  paper  we  can  see  this  color  model 
is  quite  sensitive  to  lighting  conditions  since  the 
color  attribute  is  combined  with  the  brightness. 
Hue  (color)  component  can  be  used  for  facial 
region  localization  because  it  is  comparatively 
insensitive  to  illumination  changes.  Hue  image  is 
obtained  by  logarithmic  color-space  transform, 
RGB  to  HSV.  However,  simple  Hue  image  can 
be  easily  affected  by  complex  background 
texture.  Additional  Saturation  component  can 
compensate  this  lack  of  robustness  to  the 
intricate  environment. 

S.  Birchfield  [2]  introduced  his  own  color 
scheme;  in  our  experiments  we  call  it  the 
Stanford  scheme,  which  uses  color  space 
consisting  of  scaled  versions  of  the  three  axes  B- 
G,  G-R,  and  B+G+R.  The  first  two  contain  the 
chrominance  information  and  are  sampled  into 
eight  bins  each,  while  the  last  one  contains  the 
luminance  information  and  is  sampled  more 
coarsely  into  four  bins.  The  big  difference  in  his 
method  is  that  he  also  considers  luminance 
information.  By  using  this  scheme  we  could  get 
fairly  good  tracking  result.  However,  this  scheme 
shows  partial  dependency  on  light  condition  and 
background  texture. 


Table  1  and  2  shows  head  tracking  result  of 
various  color  schemes  we  chose  for  test.  As  it  is 
shown  below,  Hybrid  color  histogram  with 
(20(Stanford)  +  4(Hue)  +4(Saturation))  bins 
gives  the  best  results  compared  to  Hue  (16),  Hue 
and  Saturation  (S  +  8),  Normalized  RGB. 
Stanford  scheme  (20)  and  Hue-hybrid  (20  + 
8(Hue))  color  histogram. 

We  employed  the  average  distance  from  the  true 
center  (Table  1)  and  the  average  success  rate 
(Table  2)  as  performance  measurements.  True 
center  of  each  frame  was  firstly  obtained  by 
manual  operation  through  the  whole  sequence. 
Average  distance  was  calculated  based  on  this 
series  of  true  center  points.  Each  test  was 
implemented  both  of  X  and  Y  directions  to 
provide  a  better  benchmark  of  tracking  result 
evaluation. 


X 

Figure  1 :  Manually  grabbed  facial  region 
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Figure  1  shows  the  facial  area  and  center  point  of 
that  region.  Hit  number  for  each  sequence  of 
Table  2  is  counted  up  when  the  destination  point 
is  located  inside  of  this  rectangular  region.  There 
is  acceptable  error  range  of  five  to  ten  pixels 
depends  on  the  image. 

From  the  result  of  Table  1  and  Table  2,  Hybrid 
color  (20+4+4)  gives  5.86  pixels  distance  to  the 
X  axis  and  8.96  pixels  to  the  Y  axis.  This  is 
fairly  good  result  compared  to  other  two 
competent  color  schemes  of  Hybrid  (20+8)  and 
Stanford’s  (20).  The  result  of  Table  2  well 
supports  this  consequence. 

We  can  expect  better  result  only  with  additional 
Hue  color  (20+8).  However,  this  color  gave 
worse  result  for  the  sequence  3.  Success  ratio  to 
the  Y  axis  of  sequence  3  is  less  than  50%.  This 
means  that  Hue  information  is  not  stable  enough 
to  support  Stanford  color  completely. 

Stanford  color  scheme  includes  Normalized 
color  and  Regular  RGB  color.  Even  though  their 
color  system  provides  comparatively  good 
results,  it  is  still  not  robust  enough  under 
different  conditions.  Our  test  result  shows  that 
additional  Hue  and  Saturation  color  features  can 
attenuate  the  performance  limitation  of  Stanford 
color. 

3.  Random-walk  Kalman  Filter 

A  robust  head  tracking  requires  a  reliable 
prediction  module  for  the  estimation  of  the  of  the 
random  moving  objects.  Our  approach  is  on  the 
base  of  Stan  Birchfield’s  [2]  method,  which 
using  intensity  gradients,  color  histograms,  and 
simple  linear  prediction.  In  gradient,  an  ellipse 
template  is  used  to  calculate  the  total  gradient 
value  around  this  ellipse  within  a  suitable  search 
window  and  then  acquires  a  maximum  value.  In 
color,  a  face  color  histogram  model  will  be 
created  and  used  to  match  within  the  above 
search  window.  Birchfield  also  used  a  linear 
prediction  to  predict  the  search  window  on  the 
oncoming  frame  according  to  the  position  of  the 
previous  2  frames. 

The  main  problem  of  the  Birchfield  method  is 
the  lack  of  accuracy  if  the  moving  speed  of  the 
head  is  too  fast  or  the  frame  rate  is  too  low.  The 
result  is  a  unreliable  prediction  window  and  the 
head  position  will  be  distracted.  In  this  case,  the 
way  to  improve  the  tracking  performance  is  to 
increase  the  search  range  of  search  window. 


however  this  will  cause  the  processing  speed 
down.  So,  there  exists  a  limitation  in  using  the 
linear  prediction  algorithms  used  by  Birchfield. 

In  order  to  overcome  this  problem,  we  propose  a 
random  walk  Kalman  filter  to  predict  the  search 
window  with  a  center  of  head  position  and  a 
suitable  range  on  the  consecutive  frames,  and 
then  update  this  prediction  using  the 
measurement  value  of  the  tracking  head. 

Kalman  filter  is  an  optimal  estimator.  It 
addresses  the  general  problem  of  trying  to 
estimate  parameters  of  interest  from  indirect, 
inaccurate  and  uncertain  measurements.  Due  to 
its  recursion,  new  measurement  data  can  be  fed 
back  to  system  as  they  arrive,  so  it  can  be  used  in 
real-time  image  processing  system. 

Kalman  filter  estimates  a  process  by  using  a 
form  of  feedback  control:  the  filter  estimates  the 
process  state  at  some  time  and  then  obtains 
feedback  in  the  form  of  (noisy)measurements.  As 
such,  the  equations  for  the  Kalman  filter  fall  into 
two  groups:  time  update  equations  and 
measurement  update  equations  [4].  The  time 
update  equations  are  responsible  for  projecting 
forward  (in  time)  the  current  state  and  error 
covariance  estimates  to  obtain  the  a  priori 
estimates  for  the  next  time  step.  The 
measurement  update  equations  are  responsible 
for  the  feedback-i.e.  for  incorporating  a  new 
measurement  into  the  a  priori  estimate  to  obtain 
an  improved  a  posteriori  estimate.  To  adapt  this 
prediction  method  to  our  random  tracking  needs 
we  introduce  new  algorithms. 

In  our  system,  we  construct  the  system  model  as 
random  walk.  Some  related  equations  are  as 
follows: 

The  state  vector  Xk  =  [**>•>>*]>  where  xk,  yk 

indicate  the  center  position  of  head  on  the  kth 
frame  image. 

The  measurement  vector  Zk  =K  >yj’ 

where  xzk  yzk  express  the  measurement  value 
from  our  approach. 

(1)  x~=u(t), 

u(t)  =  unity  Gaussian  white  noise,  that  is  random 
walk  which  means  it  has  zero  mean  and  unity 
variance  [3]. 

(2)  zk=Hxk+vk 
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From  (1),  (2),  we  can  construct  parameters  of 
Kalman  filter  as  follow: 

Transmition  matrix 
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The  initial  a  priori  estimate  error 


0 
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It  will  show  different  performance  by  using 
different  frame  rate  sequence  of  image.  We 
captured  some  different  image  sequence  with 
different  frame  rate,  10,24  frames  per  second.  If 
we  use  24  fps  image  sequence,  there  are  no 
problems.  Following  sample  results  are  from  a 
10  fps  image  sequence.  In  this  sequence,  the 
maximum  head  displacement  between  2 
consecutive  frames  is  about  62  pixels.  If  using 
the  linear  prediction,  the  center  of  search 
window  on  the  next  frame  would  be  out  of 
tracking,  particularly  on  turnover  motion.  That 
means  it  can’t  get  the  good  result.  However,  we 
got  good  results  in  our  approach  using  random 
walk  Kalman  filter.  Figure  2  (a)  and  (b)  show 
our  experiment  result  of  head  tracking  by  using 
random  walk  Kalman  filter. 


(a)  (b) 


Figure  2  :  Sample  results  from  a  10  fps  image 
sequence 

Figure  3  shows  the  x-coordinate  comparison  of 
head  position  of  Kalman  filter,  Birchfield’s,  and 
true  center.  The  real  head  positions  are  recorded 
manually.  There  are  several  pixels  calibration 
between  Kalman  filter  and  Birchfield’s 
approach. 

4.  Conclusion 

This  paper  presents  a  robust  automatic  visual 
tracking  module  that  utilizes  a  new  Hybrid  color 
scheme  with  hue  and  saturation  support  and 


random-walk  Kalman  filter  for  the  prediction  of 
the  head.  From  our  test  result,  we  can  conclude 
that  proper  mixture  of  all  of  RGB,  chromatic 
color,  Hue,  and  Saturation  gives  the  best  result 
compared  with  other  currently  available  color 
schemes  to  track  the  human  face.  Moreover,  if  it 
can  be  combined  with  random-walk  Kalman 
filter,  the  resulting  module  should  provide  a 
robust  and  reliable  tracking  method  that 
overcomes  many  current  problems  in  predicting 
the  correct  position  of  random  and  fast  moving 
objects.  The  improvements  in  these  two 
modules  shows  great  promise  for  the 
development  of  a  robust  head  tracking  for  ASR 
and  other  computer  vision  applications. 
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Table  1  :  Average  Distance  from  the  True  Center  (unit :  pixel) 


Seq.  1 

Seq.  2 

Sec 

L3 _ 

_ ggflii _ 

ESI 

ns 

5.49 

8.49 

7.44 

5.98 

5.31 

9.9 

5.19 

11.46 

5.86 

8.96 

4.49 

8.49 

8.5 

7.03 

16.09 

17.45 

3.38 

8.17 

8.12 

30.29 

Stanford 

16.99 

8.72 

11.49 

3.52 

10.08 

3.4 

7.82 

8.16 

9.75 

Hue+Saturation 

23.86 

15.05 

14.28 

3.15 

9.56 

6.44 

9.13 

12.13 

13.62 

Hue 

33.56 

20.29 

13.7 

12.31 

9.3 

10.88 

7.65 

10.18 

16.05 

13.42 

Normalized 

25.21 

9.89 

34.13 

36.8 

30.83 

17.82 

4.85 

10.97 

23.76 

18.87 

Y 

X 

Y 

X 

Y 

X  , . 

Y 

X 

Y 

X  :  x  direction  tracking  result 
Y  :  y  direction  tracking  result 


Table2  :  Average  Success  Rate  (Possibility  to  stay  in  the  facial  region  through  the  whole 
sequence) _ _ _ _ _ 


Seq.  ] 

(40*) 

Seq.  2  (65) 

Mcfl.'icK.y 

mm 

(101) 

Avg.  (%) 

B 

| 

37 

29 

59 

61 

80 

61 

93 

77 

92.4 

78.4 

m  i  nirTr.ifrftzfry 

39 

33 

51 

59 

59 

40 

101 

97 

85.9 

78.7 

Stanford 

25 

27 

47 

49 

82 

56 

101 

98 

87.6 

79.0 

Hue+Saturation 

14 

18 

36 

35 

81 

62 

91 

81 

76.3 

67.4 

Hue 

14 

21 

33 

39 

62 

49 

84 

80 

66.3 

64.9 

Normalized 

19 

27 

20 

17 

46 

30 

94 

85 

61.5 

54.6 

X 

Y 

X 

Y 

x 

Y 

X 

Y 

X 

Y 

(C) 


Seal  Normalized 

150, - - - , - 


01 - » - s - i - 1 

0  10  20  30  40 


(d) 


Figure  1  :  (a)  Stanford  (B-G)+(G-R)+(R+G+B/3)  (b)  Stanford  +  Hue(4)  +  Saturation(4) 
(c)  Hue  +  Saturation  Color  Scheme  (d)  Normalized  Color 


Figure  3  :  Comparison  x-coordinates  of  head  position  with  Kalman  filter,  Birchfield,  and  real 
center  position  (manually  recorded). 
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The  Validation  of  Military  Callsign  Intelligibility 

Celestine  A.  Ntuen  &  Misty  Blue 
The  Institute  for  Human-Machine  Studies 
Department  of  Industrial  &  Systems  Engineering 
North  Carolina  A&T  State  University 

Abstract 

This  study  was  conducted  to  evaluate  the  performance  of  human  perception  of  speech  generated 
by  computers  under  normal  and  stressful  military  environments.  Performance  intensity  (PI) 
functions  for  speech  intelligibility  were  developed.  Results  are  used  to  determine  human  speech 
awareness  thresholds  (SAT)  for  quite  and  noise  environments. 


1.  INTRODUCTION 

Our  ability  to  perform  tasks 
effectively  in  environments  such  as  the 
battlefield,  airspace  management  (pilots  and 
air  traffic  controllers),  hospitals,  and 
manufacturing  systems,  depend  in  part  oin 
our  ability  to  process  speech  signals. 
Effective  speech  communication  requires 
clear  speaking  by  the  talker,  nonrestrictive 
transmission  channel  (medium),  and  good 
hearing  and  speech  comprehension  by  the 
listener.  These  capabilities  have  been  tested 
using  various  speech  material  and  trained 
takers  (speech  understanding  tests)  or 
listeners  (speech  intelligibility  tests) 

One  of  the  several  methods  to 
measure  our  ability  to  process  information 
generated  by  sound  or  speech  signals  is 
known  as  speech  intelligibility  (Logan, 
Greene,  &  Pisoni,  1989  ). 

Speech  Intelligibility  (SI)  is  an  index  for 
measuring  the  minimum  absolute  threshold 
of  perceiving  sound  in  a  given  environment. 
SI  is  quantitatively  defined  as  the  percentage 
of  speech  units  that  can  be  correctly 
identified  by  a  listener  over  a  given 
communication  system  in  a  given  acoustic 
environment  or  the  degree  to  which  speech 
can  be  understood  during  given  conditions 
(Letowski,  Karsh,  Vause,  Shilling,  Balias, 
Brungart  &  McKinley,  2001).  Intelligibility 
tests  evaluate  the  number  of  words  or  other 
speech  units  that  can  be  correctly  identified 
within  a  controlled  situation.  Some 
examples  of  speech  intelligibility  tests  are 
documented  in  ISO  (1986).  The  relevant 
ones  to  this  study  are: 


Diagnostic  Rhyme  Test  (DRT):  The  DRT 
uses  a  set  of  isolated  words  to  test  for 
consonant  intelligibility  in  initial  position 
(Goldstein,  1995;  Logan,  Greene  &  Pisioni, 

1 989).  The  tests  consist  of  96  word  pairs 
that  differ  by  a  single  acoustic  feature  in  the 
initial  consonant.  Word  pairs  are  chosen  to 
evaluate  the  phonetic  characteristics. 
Modified  Rhyme  Test  (MRT):  The  MRT  is  an 
extension  of  DRT,  tests  for  both  initial  and 
final  consonant  apprehension  (Logan, 
Greene  &  Pisoni,  19891).  The  test  consists 
of  50  sets  of  6  one-syllable  words  that  make 
a  total  set  of  300  words.  The  set  of  6  words 
is  played  one  at  the  time  and  the  listener 
marks  which  word  he  think  he  hears  on  a 
multiple  choice  answer  sheet. 

Diagnostic  Medial  Consonant  Test  (DMCT): 
The  DMCT  is  the  same  type  of  test  as  the 
rhyme  tests  described  before.  The  material 
consists  of  96  bi-syllable  word  pairs  like 
“stopper-stocker”  which  were  selected  to 
differ  only  with  their  intervocalic  consonant. 
2.  MILITARY  CALLSIGN  TEST  (CAT) 
The  Auditory  Research  Team  at  the  United 
States  Army  Research  Laboratory  developed 
the  CAT  test  (Letowski,  Karsh,  Vause, 
Shilling,  Balias,  Brungart,  &  McKinley, 
2001).  The  CAT  test  utilizes  military 
callsigns  for  calling  phrase.  A  single  callsign 
for  CAT  consists  of  a  word  and  a  number. 
The  word  is  a  two-syllable  military  alphabet 
code  and  a  one-syllable  number,  for 
example,  alpha  1  or  bravo  2.  due  to  their 
familiarity  with  test  material  and  task 
environments.  To  maintain  its  ecological 
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validity,  it  is  important  to  test  the  CAT  in 
quiet  conditions  so  as  to  establish  a  standard 
and  a  reference  SI  metric  for  comparison 
with  other  standard  SI  metrics(  ISO  1986). 
The  test  material  seems  to  be  a  good 
compromise  between  (1)  simplicity  and  poor 
predictive  value  of  monosyllabic  signals  and 
(2)  complexity  and  memory  load  of 
nonsense  sentences  and  long  number 
sequences  (Letowski,  2001). 

The  CAT  test  has  been  informally 
used  by  the  ARL-ART  in  several  studies  but 
is  still  lacking  proper  validation  and 
standardization.  Such  a  process  requires 
several  steps  that  need  to  be  completed 
before  the  final  version  of  the  test  may  be 
released.  One  of  these  steps  is  the 
standardization  of  SI  and  evaluation  of  the 
related  performance  intensity  (PI)  curve  for 
CAT  both  in  quiet  and  with  background 
noise 

3.  PROCEDURE  &  METHODOLOGY 
Participants 

A  group  of  24  listeners  between  the 
ages  of  1 8  and  45  participated.  All  listeners 

The  listeners  repeated  the  test  with  signal 
level  increasing  in  5dB  steps  until  they 
achieve  95%  or  better  on  both  tests  (RMS 
and  PEAK  recordings).  All  the  listeners’ 
responses  were  stored  in  a  file  and 
subsequently  imported  into  an  Excel™ 

4.  SAMPLE  RESULTS 


had  pure-tone  hearing  thresholds  better  than 
or  equal  to  20dBHL  at  audiometric 
frequencies  from  250Hz  through  8000Hz 
(ANSI  S3.6-1996)  and  no  history  of  otologic 
pathology.  An  audiometric  screening  test 
was  performed  prior  to  participation  in  the 
study. 

Each  listener  was  seated  at  the  listener 
station  in  a  sound  treated  test  booth  using  an 
IBM  PC/586  computer  and  wearing  TDH-39 
testing  earphones.All  the  instructions  were 
displayed  on  the  computer  screen  and  the 
participant  was  able  to  use  either  the 
computer  mouse  or  the  computer  keyboard 
for  data  input.  The  listener  was  asked  to 
listen  to  the  series  of  the  CAT  (military 
alphabet  callsigns  and  one  syllable  numbers 
1-8)  items  and  identify  them  by  pressing 
appropriate  keys  on  the  computer  keyboard. 
Also,  the  main  screen  showed  the  display 
CAT  test  (Peak  or  RMS)  and  the  signal-to- 
noise  ratio  (SNR)  given  by  -18  dB,  -12dB,  - 
8dB,  OdB,  6dB,  12dB. 


spreadsheet  for  analysis.  Each  listener 
participated  in  a  single  listening  session.  The 
session  lasted  about  four  hours  and  included 
audiometric  screening,  instructions,  testing 
and  several  10-15  minute  long  breaks. 

The  PI  function  showed  some 
characteristics  of  logistics  distributions  See 
example  in  Figure  2). 
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—♦—Peak 
—m —  RMS 


Figure  2:  Sample  logistics  PI  function  for  CAT  intelligibility 
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1 


;  R2  =  90% 


-0.78235 *SNR 


l+e~ 

(Peak)  (1) 

0<SNR<  11.77 

!  ;  R2  =  88.24% 


Score  - 
(RMS) 


1  +e 


-0.745 *SNR 


(2) 


0  <  SNR<  12.36 


Figure  2:  Sample  logistics  PI 
function  for  CAT  intelligibility 

5.  CONCLUSION 


The  logistics  PI  models  show 
that  speech  awareness  threshold  (SAT) 
occurs  at  signal -to-noise-ration  (SNR)  > 
0,  with  the  average  listener  achieving  an 
SI  value  of  95%  at  SNR  values  of  1 1 .64 
for  Peak  and  12.22  for  RMS.  By  using 
simple  one  parameter  linear  model, 
speech  awareness  threshold  occurs  at 
SNR  values  of  approximately  2  for  both 
Peak  and  RMS  tests,  with  the  average 
listener  achieving  an  SI  value  of  95%  at 
SNR  values  between  7.7  and  7.9. 
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Visual  Front-End 


IBM  YVAV  databases 


LVCSR 

-  Firsi-af-a-kxnd  audiovisual 
database  far  large-vocabulary- 
continuous  SI  speech 
recognition  (1.  VCSR) 

-  290  subjects 

-  70  hrs.  continuous  speech. 
10.400  word  vocabulary 

Digits 

-  50  subjects 

-  8  46  hrs.  continuous  speech.  1 1 
word  vocabulary 

Database  Format 

-  Frontal  face  color  video. 
704x480,  30  Hz.  MPEG2 

-  16  fcHz'l  6bit  pem 


Experiments  on  Digits 


Fusion  Techniques 
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Results:  LVCSR 


Experiments  on  LVCSR 


IBM  WAV  LVCSR  database 

-  Training  (261  spkrs.  35  hrs) 

-  Test  (26  spkrs.  2.5hrs)r  SI. 
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Conclusions 

Consistent  and  significant  gains  for  all  audio  conditions 
Significant  performance  gains  in  “speech-babble’’  noise 

-  Effective  gain  of  10  dB  @  10  dB  SNR  for  digits 

-  Effective  gains  of  7.5  dB  @  10  dB  for  LVCSR 

Significant  gains  in  relatively  clean  environments 

-  62%  relative  gain  for  digits  (0.75  ->  0.28) 

-  8%  for  LVCSR 

Super-human  performance  at  high-noise  levels 
Asynchrony  modeling  helps  for  digits 
Further  research  required  in  visually  challenging  domains 
Visual  adaptation  is  a  promising  approach 

-  Upto  67%  relative  improvement  in  visual  speech  recognition 
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Visual  Challenges 


Low  quality 


Illumination 


8  ■ 


Acoustic  Challenges 


Sloppy  Speech 
Noise 

Reverberation 


Acoustic  Scene  Analysis 
Cross  Talk 
Distant  Mic 


From  Tracking  to  Modeling  Activit 


Tracking  Multiple  People 


Interactive  Systems  Labs 


Where? 


Face  Tracking  (Visual) 

Sound  Source  Localization  (Acoustic) 
People  Tracking  (Visual) 

Behavior  and  Movement  Models 


Interactive  Svstcms  Labs 
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Performance  Comparison 


Verbal  and  Non-Verbal  Information 


To  Whom  ? 

Focus  of  Attention  Tracking 

-  R.  Stiefelhagen,  PUl’98,  Humanoids’01,  PhD  Thesis’02 

-  Who  is  addressee  of  an  utterance  ? 

-  Who  is  someone  making  talking  to  ? 

-  What  is  a  human  user  attending  to  ? 

Observation: 

-  FoA  is  a  Psychological  State,  can  only  be  infered  or 
‘guessed’  from  correlates 

-  Both  Observed  User  and  Target  are  important: 

•  Pose,  Eye-Gaze 

•  Possible  Targets:  Noise,  Movement,  Faces,  Speech 

Interactive  Systems  Labs  ^1 


Focus  of  Attention  Trackin 


Interactive  Systems  Labs 


Conclusion 


Complete  Model  of  Human  Communication  is  Needed 

-  Include  all  modalities 

-  Include  different  not  only  what  was  said,  but  also: 

who,  where,  to  whom,  how  .. 

Challenges: 

-  Robust  Processing  of  Component 

-  Proper  Level  and  Method  of  Fusion 

-  Robust  and  Dynamic  Fusion  of  Useful  Clues 


Interactive  Systems  Labs 
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Where  We  Started... 

[1993/1994] 


Joint  Audio-Visua!  Speech  Recognition 
and  CMU  Audio-Visual  Speech  Data  Set 

Prof.  Tsuhan  Chen 

Carnegie  Mellon  University 


Thanks  to  Dr.  Simon  Lucey  and  Jie  Huang 


Lip-Reading 


Input  Video 


Face/Up- 

Tracking 


Feature 

Extraction 


Fusion 


Audio-Visual  Speech  Data  Set 


Thanks  to  Intel 
78  isolated  words  10  times 

*  Date/ti me/month/day/etc. 

*  Audio:  44.8  kHz,  16  bits 

*  Video:  30/60Hz,  720x640 


W 


Lip  parameters  extracted 
Noises 

♦  Gaussian  white/pink  noise, 
car,  factory  (Noise-X  92) 

♦  Babble/crosstalk 

♦  Lombard  Effect 
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recognition 


Result 


Weak  Lombard  Effect 


Strong  Lombard  Effect 
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Product  Rule  vs.  Sum  Rule 
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Beyond  Multimodal  ASR... 


Input  Video 


■  w/o 

AdsptKw  smoottwig 


Face/LIp- 
T racking 


Adaptive 

Filtering 


-28  -18  -7.9  2  07  12.1  22.1  32.1  42  1  52  1  82  1 
Input  SNR  (dB) 


Cleaned  Audio 
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Multimodal  Biometrics 


Data  Collection 


CMU  Multimodal  Biometrics  Database 

Face: 

■  30  subjects  with  300  images  each 

•  Image  size:  720*480 

•  Different  lighting  conditions,  with/without  glasses  and  ambient  lighting 

HH  Al  Al  • ■  «  Eli 

Rjflhl  Center 

Left  Al  Night  shot  (IP) 

Fingerprint: 

•  Image  size:  192*128 

•  50  images  each  finger 

1  I 

Right  Index  Right  Mddle 

Iris 

*  Iris  size:  about  400*400 

•  10  images  each  eye 

m 

Tsuhan  Chen 

Mirror/wheel/panel/seat  adjustment 


Driver-Vehide  Interfaces 
Cognitive  Overflow  Study 


Interview  Video 


Multimodal  User  Interfaces 


Fingerprint  and  Iris  Images 


[CMU-GM  Lab] 


Airbag  Deployment  Control 


L 


Demo  Vehicle 


Demo  Vehicle 
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Advanced  Multimedia  Processing  Lab 

Please  visit  us  at: 

http://amp.ece.cmu.edu 

Or,  please  email  me  at 
tsuhan@cmu.edu 

_ _  TsuhanChen 
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